arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/entry-common.h | 10 +- include/linux/mm_types.h | 4 + include/linux/sched.h | 30 ++ include/linux/syscalls.h | 2 + include/linux/task_shared.h | 63 +++++ include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/task_shared.h | 29 ++ init/Kconfig | 10 + kernel/entry/common.c | 15 +- kernel/fork.c | 12 + kernel/sched/core.c | 28 ++ kernel/sched/debug.c | 4 + kernel/sched/syscalls.c | 7 + kernel/sys_ni.c | 2 + mm/Makefile | 1 + mm/mmap.c | 13 + mm/task_shared.c | 366 +++++++++++++++++++++++++ 19 files changed, 593 insertions(+), 9 deletions(-) create mode 100644 include/linux/task_shared.h create mode 100644 include/uapi/linux/task_shared.h create mode 100644 mm/task_shared.c
A user thread can get preempted in the middle of executing a critical section in user space while holding locks, can have undesirable affect on performance. Having a way for the thread to request additional execution time on cpu so that it can complete the critical section will be useful in such scenario. The request can be made by setting a bit in mapped memory, such that the kernel can also access to check and grant extra execution time on the cpu. There have been couple of proposals[1][2] for such a feature, which attempt to address the above scenario by granting one extra tick of execution time. In patch thread [1] posted by Steven Rostedt, there is ample discussion about need for this feature. However, the concern has been that this can lead to abuse. One extra tick can be a long time(about a millisec or more). Peter Zijlstra in response posted a prototype solution[3], which grants 50us execution time extension only. This is achieved with the help of a timer started on that cpu at the time of granting extra execution time. When the timer fires the thread will be preempted, if still running. This patch set implements the above mentioned 50us extension time as posted by Peter. But instead of using restartable sequences as API to set the flag to request the extension, this patch proposes a new API with use of a per thread shared structure implementation described below. This shared structure is accessible in both users pace and kernel. The user thread will set the flag in this shared structure to request execution time extension. We tried this change with a database workload. Here are the results, comparing runs with and without use of the 50us scheduler time slice extension. This clearly shows there is benefit. Test results: ============= Test system 2 socket AMD Genoa Lock table test:- a simple database test to grab table lock(spin lock). Simulates sql query executions. 300 clients + 400 cpu hog tasks to generate load. Without extension : 182K SQL exec/sec With extension : 262K SQL exec/sec 44% improvement. Swingbench - standard database benchmark Cached(database files on tmpfs) run, with 1000 clients. Without extension : 99K SQL exec/sec with extension : 153K SQL exec/sec 55% improvement in throughput. Shared structure mechanism: ========================== A per thread structure is allocated from a page shared mapped between user space and kernel. This will be useful in sharing thread specific information between user space and kernel without the need for making system calls in latency sensitive code path. Implementation: A new system call is added to request use of a shared structure by a user thread. Kernel will allocate page(s), shared mapped with user space in which per-thread shared structures will be allocated. These structures are padded to 128 bytes. Multiple such shared structures will be allocated from that page(upto 32 per 4k page) to accommodate requests from multiple threads in a process, thus avoiding the need to allocate one page per thread. Additional pages are allocated as needed to accommodate more thread's requesting the shared structure. The number of pages required will depend on the number of threads of the process requesting/using shared structure. These pages are pinned and so the kernel can access/update the shared structure thru the kernel address, without the need for copy_from_user/copy_to_user() calls. The system call will return a pointer(user address) to the per thread shared structure. Application threads could save this per thread pointer in a TLS variable and reference it. Request for scheduler time extension described above, will be a use case of this shared structure API. The user thread will request execution time extension by setting a flag in the shared structure. Additional members can be appended to the shared structure to implement new features that address other use cases. For example sharing thread's time spent off cpu and on run queue(described in [4]). These would help the user thread measure cpu time consumption accross some operation, without having to call getrusage() system call or read /proc/pid/schedstat frequently. Another use case is to share user thread's state 'on' or 'off' cpu thru the shared structure - which can be useful in implementing adaptive waits in user space(as discussed in [1]). The waiter thread checks if the owner of the resource(lock) is 'on' cpu to continue spinning. API: === The system call int task_getshared(int option, int flags, void __user *uaddr) Only supports TASK_SHAREDINFO option for now, 'flags' are not used. /* option */ #define TASK_SHAREDINFO 1 struct task_sharedinfo { volatile unsigned short sched_delay; }; #define TASK_PREEMPT_DELAY_REQ 1 #define TASK_PREEMPT_DELAY_GRANTED 2 #define TASK_PREEMPT_DELAY_DENIED 3 Following call: __thread struct task_sharedinfo *ts; task_getshared(TASK_SHAREDINFO, 0, &ts); User task sets 'sched_delay' member to TASK_PREEMPT_DELAY_REQ to request scheduler time extension. Kernel sets 'sched_delay' to TASK_PREEMPT_DELAY_GRANTED to indicate if the request for execution time extension was granted or TASK_PREEMPT_DELAY_DENIED if denied. [1] https://lore.kernel.org/lkml/20231025054219.1acaa3dd@gandalf.local.home/ [2] https://lore.kernel.org/lkml/1395767870-28053-1-git-send-email-khalid.aziz@oracle.com/ [3] https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net/ [4] https://lore.kernel.org/all/1631147036-13597-1-git-send-email-prakash.sangappa@oracle.com/ Prakash Sangappa (4): Introduce per thread user-kernel shared structure Scheduler time extention Indicate if schedular preemption delay request is granted Add scheduler preemption delay granted stats arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/entry-common.h | 10 +- include/linux/mm_types.h | 4 + include/linux/sched.h | 30 ++ include/linux/syscalls.h | 2 + include/linux/task_shared.h | 63 +++++ include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/task_shared.h | 29 ++ init/Kconfig | 10 + kernel/entry/common.c | 15 +- kernel/fork.c | 12 + kernel/sched/core.c | 28 ++ kernel/sched/debug.c | 4 + kernel/sched/syscalls.c | 7 + kernel/sys_ni.c | 2 + mm/Makefile | 1 + mm/mmap.c | 13 + mm/task_shared.c | 366 +++++++++++++++++++++++++ 19 files changed, 593 insertions(+), 9 deletions(-) create mode 100644 include/linux/task_shared.h create mode 100644 include/uapi/linux/task_shared.h create mode 100644 mm/task_shared.c -- 2.43.5
On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote: > This patch set implements the above mentioned 50us extension time as posted > by Peter. But instead of using restartable sequences as API to set the flag > to request the extension, this patch proposes a new API with use of a per > thread shared structure implementation described below. This shared structure > is accessible in both users pace and kernel. The user thread will set the > flag in this shared structure to request execution time extension. But why -- we already have rseq, glibc uses it by default. Why add yet another thing?
> On Nov 13, 2024, at 10:50 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote: > >> This patch set implements the above mentioned 50us extension time as posted >> by Peter. But instead of using restartable sequences as API to set the flag >> to request the extension, this patch proposes a new API with use of a per >> thread shared structure implementation described below. This shared structure >> is accessible in both users pace and kernel. The user thread will set the >> flag in this shared structure to request execution time extension. > > But why -- we already have rseq, glibc uses it by default. Why add yet > another thing? Mainly this provides pinned memory, where kernel can access without the concern of taking a page fault.There could be other use cases in future, where updating user memory in kernel context cannot afford to take pagefaults. For example implementing use case like providing the thread state, needs to be done in the context switch path when the thread is going off cpu. -Prakash
On 2024-11-13 13:50, Peter Zijlstra wrote: > On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote: > >> This patch set implements the above mentioned 50us extension time as posted >> by Peter. But instead of using restartable sequences as API to set the flag >> to request the extension, this patch proposes a new API with use of a per >> thread shared structure implementation described below. This shared structure >> is accessible in both users pace and kernel. The user thread will set the >> flag in this shared structure to request execution time extension. > > But why -- we already have rseq, glibc uses it by default. Why add yet > another thing? Indeed, what I'm not seeing in this RFC patch series cover letter is an explanation that justifies adding yet another per-thread memory area shared between kernel and userspace when we have extensible rseq already. Peter, was there anything fundamentally wrong with your approach based on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net The main thing I wonder is whether loading the rseq delay resched flag on return to userspace is too late in your patch. Also, I'm not sure it is realistic to require that no system calls should be done within time extension slice. If we have this scenario: A) userspace grabs lock - set rseq delay resched flag B) syscall - reschedule [...] - return to userspace, load rseq delay-resched flag from userspace (too late) I would have thought loading the delay resched flag should be attempted much earlier in the scheduler code. Perhaps we could do this from a page fault disable critical section, and accept that this hint may be a no-op if the rseq page happens to be swapped out (which is really unlikely). This is similar to the "on_cpu" sched state rseq extension RFC I posted a while back, which needed to be accessed from the scheduler: https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/ https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/ And we'd leave the delay-resched load in place on return to userspace, so in the unlikely scenario where it is swapped out, at least it gets paged back at that point. Feel free to let me know if I'm missing an important point and/or saying nonsense here. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Wed, Nov 13, 2024 at 02:36:58PM -0500, Mathieu Desnoyers wrote: > On 2024-11-13 13:50, Peter Zijlstra wrote: > > On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote: > > > > > This patch set implements the above mentioned 50us extension time as posted > > > by Peter. But instead of using restartable sequences as API to set the flag > > > to request the extension, this patch proposes a new API with use of a per > > > thread shared structure implementation described below. This shared structure > > > is accessible in both users pace and kernel. The user thread will set the > > > flag in this shared structure to request execution time extension. > > > > But why -- we already have rseq, glibc uses it by default. Why add yet > > another thing? > > Indeed, what I'm not seeing in this RFC patch series cover letter is an > explanation that justifies adding yet another per-thread memory area > shared between kernel and userspace when we have extensible rseq > already. > > Peter, was there anything fundamentally wrong with your approach based > on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net Not that I can remember, but it's a long time ago :-) > The main thing I wonder is whether loading the rseq delay resched flag > on return to userspace is too late in your patch. Too late how? It only loads it at the point we would've called schedule() -- no point in looking at it otherwise, right? > Also, I'm not sure it is > realistic to require that no system calls should be done within time extension > slice. If we have this scenario: Well, the whole premise is that syscalls are too expensive. If they are not, then you shouldn't be using this in the first place.
On 2024-11-14 05:14, Peter Zijlstra wrote: > On Wed, Nov 13, 2024 at 02:36:58PM -0500, Mathieu Desnoyers wrote: >> On 2024-11-13 13:50, Peter Zijlstra wrote: >>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote: >>> >>>> This patch set implements the above mentioned 50us extension time as posted >>>> by Peter. But instead of using restartable sequences as API to set the flag >>>> to request the extension, this patch proposes a new API with use of a per >>>> thread shared structure implementation described below. This shared structure >>>> is accessible in both users pace and kernel. The user thread will set the >>>> flag in this shared structure to request execution time extension. >>> >>> But why -- we already have rseq, glibc uses it by default. Why add yet >>> another thing? >> >> Indeed, what I'm not seeing in this RFC patch series cover letter is an >> explanation that justifies adding yet another per-thread memory area >> shared between kernel and userspace when we have extensible rseq >> already. >> >> Peter, was there anything fundamentally wrong with your approach based >> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net > > Not that I can remember, but it's a long time ago :-) > >> The main thing I wonder is whether loading the rseq delay resched flag >> on return to userspace is too late in your patch. > > Too late how? It only loads it at the point we would've called > schedule() -- no point in looking at it otherwise, right? [...] For the specific return-to-userspace path, I think where you've placed the delay-resched flag check is fine. I'm concerned about other code paths that invoke schedule() besides return-to-userspace. For instance: raw_irqentry_exit_cond_resched(): if (!preempt_count()) { [...] if (need_resched()) preempt_schedule_irq(); } AFAIU, this could be triggered by an interrupt handler exit when nested over a page fault handler, exception handler, or system call. We may decide that we cannot care less about those scenarios, and just ignore the delay-resched flag, but it's relevant to take those into consideration and clearly document the rationale behind our decision. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
> On Nov 15, 2024, at 6:41 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2024-11-14 05:14, Peter Zijlstra wrote: >> On Wed, Nov 13, 2024 at 02:36:58PM -0500, Mathieu Desnoyers wrote: >>> On 2024-11-13 13:50, Peter Zijlstra wrote: >>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote: >>>> >>>>> This patch set implements the above mentioned 50us extension time as posted >>>>> by Peter. But instead of using restartable sequences as API to set the flag >>>>> to request the extension, this patch proposes a new API with use of a per >>>>> thread shared structure implementation described below. This shared structure >>>>> is accessible in both users pace and kernel. The user thread will set the >>>>> flag in this shared structure to request execution time extension. >>>> >>>> But why -- we already have rseq, glibc uses it by default. Why add yet >>>> another thing? >>> >>> Indeed, what I'm not seeing in this RFC patch series cover letter is an >>> explanation that justifies adding yet another per-thread memory area >>> shared between kernel and userspace when we have extensible rseq >>> already. >>> >>> Peter, was there anything fundamentally wrong with your approach based >>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net >> Not that I can remember, but it's a long time ago :-) >>> The main thing I wonder is whether loading the rseq delay resched flag >>> on return to userspace is too late in your patch. >> Too late how? It only loads it at the point we would've called >> schedule() -- no point in looking at it otherwise, right? > > [...] > > For the specific return-to-userspace path, I think where you've placed > the delay-resched flag check is fine. > > I'm concerned about other code paths that invoke schedule() besides > return-to-userspace. For instance: > > raw_irqentry_exit_cond_resched(): > > if (!preempt_count()) { > [...] > if (need_resched()) > preempt_schedule_irq(); > } > > AFAIU, this could be triggered by an interrupt handler exit when nested > over a page fault handler, exception handler, or system call. > > We may decide that we cannot care less about those scenarios, and just > ignore the delay-resched flag, but it's relevant to take those into > consideration and clearly document the rationale behind our decision. Don’t think the delay-resched will address all scenarios where preemption can occur when in critical section. We could aim to address frequent paths where a task can get preempted. Initially the intent was to prevent preemption mainly at the end of time slice, if the thread is in a critical section in the user space and has requested delaying reschedule.. Another path to consider is the wakeups occurring on a different cpu which could enqueue a thread and attempt to preempt this thread when it is running in the critical section. Should it check if the thread running has been granted extra time i.e the ‘taskshrd_sched_delay’ has been set for the running thread, avoid setting TIF_NEED_RESCHED in resched_curr() and sending IPI, ie like lazy preemption? If ’taskshrd_sched_delay’ has been set we know it will get preempted due to the timer soon, or the task would sched_yield(if it is a well behaving application). > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2024-11-13 13:50, Peter Zijlstra wrote: >> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote: >>> This patch set implements the above mentioned 50us extension time as posted >>> by Peter. But instead of using restartable sequences as API to set the flag >>> to request the extension, this patch proposes a new API with use of a per >>> thread shared structure implementation described below. This shared structure >>> is accessible in both users pace and kernel. The user thread will set the >>> flag in this shared structure to request execution time extension. >> But why -- we already have rseq, glibc uses it by default. Why add yet >> another thing? > > Indeed, what I'm not seeing in this RFC patch series cover letter is an > explanation that justifies adding yet another per-thread memory area > shared between kernel and userspace when we have extensible rseq > already. It mainly provides pinned memory, can be useful for future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults. > > Peter, was there anything fundamentally wrong with your approach based > on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net > > The main thing I wonder is whether loading the rseq delay resched flag > on return to userspace is too late in your patch. Also, I'm not sure it is > realistic to require that no system calls should be done within time extension > slice. If we have this scenario: I am also not sure if we need to prevent system calls in this scenario. Was that restriction mainly because of restartable sequence API implements it? -Prakash > > A) userspace grabs lock > - set rseq delay resched flag > B) syscall > - reschedule > [...] > - return to userspace, load rseq delay-resched flag from userspace (too late) > > I would have thought loading the delay resched flag should be attempted much > earlier in the scheduler code. Perhaps we could do this from a page fault > disable critical section, and accept that this hint may be a no-op if the > rseq page happens to be swapped out (which is really unlikely). This is > similar to the "on_cpu" sched state rseq extension RFC I posted a while back, > which needed to be accessed from the scheduler: > > https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/ > https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/ > > And we'd leave the delay-resched load in place on return to userspace, so > in the unlikely scenario where it is swapped out, at least it gets paged > back at that point. > > Feel free to let me know if I'm missing an important point and/or saying > nonsense here. > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote: > > > > On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > > > On 2024-11-13 13:50, Peter Zijlstra wrote: > >> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote: > >>> This patch set implements the above mentioned 50us extension time as posted > >>> by Peter. But instead of using restartable sequences as API to set the flag > >>> to request the extension, this patch proposes a new API with use of a per > >>> thread shared structure implementation described below. This shared structure > >>> is accessible in both users pace and kernel. The user thread will set the > >>> flag in this shared structure to request execution time extension. > >> But why -- we already have rseq, glibc uses it by default. Why add yet > >> another thing? > > > > Indeed, what I'm not seeing in this RFC patch series cover letter is an > > explanation that justifies adding yet another per-thread memory area > > shared between kernel and userspace when we have extensible rseq > > already. > > It mainly provides pinned memory, can be useful for future use cases > where updating user memory in kernel context can be fast or needs to > avoid pagefaults. 'might be useful' it not good enough a justification. Also, I don't think you actually need this. See: https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com for a more elaborate scheme. > > Peter, was there anything fundamentally wrong with your approach based > > on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net > > > > The main thing I wonder is whether loading the rseq delay resched flag > > on return to userspace is too late in your patch. Also, I'm not sure it is > > realistic to require that no system calls should be done within time extension > > slice. If we have this scenario: > > I am also not sure if we need to prevent system calls in this scenario. > Was that restriction mainly because of restartable sequence API implements it? No, the whole premise of delaying resched was because people think that syscalls are too slow. If you do not think this, then you shouldn't be using this.
> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote: >> >> >>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >>> >>> On 2024-11-13 13:50, Peter Zijlstra wrote: >>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote: >>>>> This patch set implements the above mentioned 50us extension time as posted >>>>> by Peter. But instead of using restartable sequences as API to set the flag >>>>> to request the extension, this patch proposes a new API with use of a per >>>>> thread shared structure implementation described below. This shared structure >>>>> is accessible in both users pace and kernel. The user thread will set the >>>>> flag in this shared structure to request execution time extension. >>>> But why -- we already have rseq, glibc uses it by default. Why add yet >>>> another thing? >>> >>> Indeed, what I'm not seeing in this RFC patch series cover letter is an >>> explanation that justifies adding yet another per-thread memory area >>> shared between kernel and userspace when we have extensible rseq >>> already. >> >> It mainly provides pinned memory, can be useful for future use cases >> where updating user memory in kernel context can be fast or needs to >> avoid pagefaults. > > 'might be useful' it not good enough a justification. Also, I don't > think you actually need this. Will get back with database benchmark results using rseq API for scheduler time extension. > > See: > > https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com > > for a more elaborate scheme. > >>> Peter, was there anything fundamentally wrong with your approach based >>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net >>> >>> The main thing I wonder is whether loading the rseq delay resched flag >>> on return to userspace is too late in your patch. Also, I'm not sure it is >>> realistic to require that no system calls should be done within time extension >>> slice. If we have this scenario: >> >> I am also not sure if we need to prevent system calls in this scenario. >> Was that restriction mainly because of restartable sequence API implements it? > > No, the whole premise of delaying resched was because people think that > syscalls are too slow. If you do not think this, then you shouldn't be > using this. Agree. Thanks, -Prakash
On 2024-11-14 14:42, Prakash Sangappa wrote: > > >> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote: [...] >> >> See: >> >> https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com >> >> for a more elaborate scheme. >> >>>> Peter, was there anything fundamentally wrong with your approach based >>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net >>>> >>>> The main thing I wonder is whether loading the rseq delay resched flag >>>> on return to userspace is too late in your patch. Also, I'm not sure it is >>>> realistic to require that no system calls should be done within time extension >>>> slice. If we have this scenario: >>> >>> I am also not sure if we need to prevent system calls in this scenario. >>> Was that restriction mainly because of restartable sequence API implements it? >> >> No, the whole premise of delaying resched was because people think that >> syscalls are too slow. If you do not think this, then you shouldn't be >> using this. > > Agree. I only partially agree with Peter here. I agree that we don't want to add system calls on the delay-resched critical section fast path, because this would have a significant performance hit. But there are scenarios where issuing system calls from within that critical section would be needed, even though those would not belong to the fast path: 1) If a signal handler nests over a delay-resched critical section. That signal handler is allowed to issue system calls. 2) If the critical section fast-path is calling GNU C library API and/or a vDSO, which is typically fast, but can end up calling a system call as fallback. e.g. clock_gettime, sched_getcpu. Preventing use of a system call by killing the application punches a hole in the abstractions meant to be provided by GNU libc and vDSO. I would recommend that we allow issuing system calls while the delay-resched bit is set. However, we may not strictly need to honor the delay-resched hint from a system call context, as those would be expected to be either infrequent or a portability fallback, which means the enhanced performance provided by delay-resched really won't matter. Another scenario to keep in mind are page faults happening within a delay-resched critical section. This is a scenario where page fault handling can explicitly reschedule. If this happens, I suspect we really don't care about the delay-resched hint, but we should consider whether this hint should be left as-is or cleared. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
> On Nov 15, 2024, at 6:13 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2024-11-14 14:42, Prakash Sangappa wrote: >>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > [...] > >>> >>> See: >>> >>> https://lkml.kernel.org/r/20220113233940.3608440-4-posk@google.com >>> >>> for a more elaborate scheme. >>> >>>>> Peter, was there anything fundamentally wrong with your approach based >>>>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net >>>>> >>>>> The main thing I wonder is whether loading the rseq delay resched flag >>>>> on return to userspace is too late in your patch. Also, I'm not sure it is >>>>> realistic to require that no system calls should be done within time extension >>>>> slice. If we have this scenario: >>>> >>>> I am also not sure if we need to prevent system calls in this scenario. >>>> Was that restriction mainly because of restartable sequence API implements it? >>> >>> No, the whole premise of delaying resched was because people think that >>> syscalls are too slow. If you do not think this, then you shouldn't be >>> using this. >> Agree. > > I only partially agree with Peter here. I agree that we don't want to > add system calls on the delay-resched critical section fast path, > because this would have a significant performance hit. > > But there are scenarios where issuing system calls from within that > critical section would be needed, even though those would not belong > to the fast path: > > 1) If a signal handler nests over a delay-resched critical section. > That signal handler is allowed to issue system calls. > > 2) If the critical section fast-path is calling GNU C library API and/or > a vDSO, which is typically fast, but can end up calling a system call > as fallback. e.g. clock_gettime, sched_getcpu. Preventing use of a > system call by killing the application punches a hole in the > abstractions meant to be provided by GNU libc and vDSO. > > I would recommend that we allow issuing system calls while the > delay-resched bit is set. However, we may not strictly need to honor > the delay-resched hint from a system call context, as those would > be expected to be either infrequent or a portability fallback, > which means the enhanced performance provided by delay-resched > really won't matter. > > Another scenario to keep in mind are page faults happening within a > delay-resched critical section. This is a scenario where page fault > handling can explicitly reschedule. If this happens, I suspect we > really don't care about the delay-resched hint, but we should consider > whether this hint should be left as-is or cleared. > There are no explicit checks right now that a system call occurred in the critical section, except under DEBUG_RSEQ, which probably will not apply for the delay-resched case. But the question was should there be checks implemented, probably not due to scenarios you mentioned above. > Thoughts ? > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On 2024-11-13 15:10, Prakash Sangappa wrote: > > >> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >> >> On 2024-11-13 13:50, Peter Zijlstra wrote: >>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote: >>>> This patch set implements the above mentioned 50us extension time as posted >>>> by Peter. But instead of using restartable sequences as API to set the flag >>>> to request the extension, this patch proposes a new API with use of a per >>>> thread shared structure implementation described below. This shared structure >>>> is accessible in both users pace and kernel. The user thread will set the >>>> flag in this shared structure to request execution time extension. >>> But why -- we already have rseq, glibc uses it by default. Why add yet >>> another thing? >> >> Indeed, what I'm not seeing in this RFC patch series cover letter is an >> explanation that justifies adding yet another per-thread memory area >> shared between kernel and userspace when we have extensible rseq >> already. > > It mainly provides pinned memory, can be useful for future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults. Does the targeted use-case (scheduler time slice extension) require pinned memory, or just future use-cases ? Does having a missing time slice extension hint for a short while in case of high memory pressure (rseq page swapped out) have any measurable impact compared to the overhead of the page faults which will be happening in case of the high memory pressure required to trigger this scenario ? > >> >> Peter, was there anything fundamentally wrong with your approach based >> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net >> >> The main thing I wonder is whether loading the rseq delay resched flag >> on return to userspace is too late in your patch. Also, I'm not sure it is >> realistic to require that no system calls should be done within time extension >> slice. If we have this scenario: > > I am also not sure if we need to prevent system calls in this scenario. I suspect that if we prohibit system calls from being issued from a delay-resched userspace critical section, then loading the delay-resched rseq flag on return to userspace is always fine, because the kernel only reschedules on return from interrupt or trap. But I see this no-syscall restriction as being very cumbersome for userspace. > Was that restriction mainly because of restartable sequence API implements it? I suspect that this restriction is just to avoid loading the delay-resched flag from the scheduler when reschedule is called from an interrupt handler nested over a system call for preemptible kernels, but Peter could tell us more. One open question here is whether we want to pin memory for each thread in the system to hold this shared data between userspace and the scheduler. AFAIU, this is a trade-off between slice extension accuracy in high memory usage scenarios and pinned memory footprint impact. If the tradeoff goes towards making this memory pinned, then we may want to consider pinning the per-thread rseq area on rseq registration. Another option to consider is to use rseq to index a userspace per-cpu data structure, which will be used as shared memory between kernel and userspace. Userspace could store this delay-resched flag into the current CPU's shared data, and the scheduler could load it from there. If pinning per-cpu data is more acceptable than pinning per-thread data, then it could be an improvement. This could be a new ABI between kernel and userspace, e.g.: struct rseq_percpu_area { __u32 sched_state; /* includes time slice extension flag. */ char end[]; }; Registered to the kernel with the following parameters: - Address of rseq_percpu_area for CPU 0, - The stride of the per-cpu indexing (see librseq mempool per-cpu allocator [1]), - offsetof(struct rseq_percpu_area, end) to have the feature size for extensibility. Thanks, Mathieu [1] https://lpc.events/event/18/contributions/1720/attachments/1572/3268/presentation-lpc2024-rseq-mempool.pdf > > -Prakash > >> >> A) userspace grabs lock >> - set rseq delay resched flag >> B) syscall >> - reschedule >> [...] >> - return to userspace, load rseq delay-resched flag from userspace (too late) >> >> I would have thought loading the delay resched flag should be attempted much >> earlier in the scheduler code. Perhaps we could do this from a page fault >> disable critical section, and accept that this hint may be a no-op if the >> rseq page happens to be swapped out (which is really unlikely). This is >> similar to the "on_cpu" sched state rseq extension RFC I posted a while back, >> which needed to be accessed from the scheduler: >> >> https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/ >> https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/ >> >> And we'd leave the delay-resched load in place on return to userspace, so >> in the unlikely scenario where it is swapped out, at least it gets paged >> back at that point. >> >> Feel free to let me know if I'm missing an important point and/or saying >> nonsense here. >> >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >> > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
> On Nov 13, 2024, at 12:57 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2024-11-13 15:10, Prakash Sangappa wrote: >>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >>> >>> On 2024-11-13 13:50, Peter Zijlstra wrote: >>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote: >>>>> This patch set implements the above mentioned 50us extension time as posted >>>>> by Peter. But instead of using restartable sequences as API to set the flag >>>>> to request the extension, this patch proposes a new API with use of a per >>>>> thread shared structure implementation described below. This shared structure >>>>> is accessible in both users pace and kernel. The user thread will set the >>>>> flag in this shared structure to request execution time extension. >>>> But why -- we already have rseq, glibc uses it by default. Why add yet >>>> another thing? >>> >>> Indeed, what I'm not seeing in this RFC patch series cover letter is an >>> explanation that justifies adding yet another per-thread memory area >>> shared between kernel and userspace when we have extensible rseq >>> already. >> It mainly provides pinned memory, can be useful for future use cases where updating user memory in kernel context can be fast or needs to avoid pagefaults. > > Does the targeted use-case (scheduler time slice extension) require > pinned memory, or just future use-cases ? Probably not for sched time slice extension. Although, we have not run the database workload using restartable sequence. We can try that and get back. What about publishing thread state in the shared structure? It would require updating user memory in context switching code path when thread is going off the cpu. Here having pinned memory would help. Thread state is the other requirement we are interested in. > > Does having a missing time slice extension hint for a short while in > case of high memory pressure (rseq page swapped out) have any measurable Yes, but the restartable sequence based implementation does copy_from_user(), which would go thru the pagefault. It may be ok here. Perpaps it would require disabling page faults in this case? If the page is not present, then the hint would most likely not matter and kernel can skip. > impact compared to the overhead of the page faults which will be > happening in case of the high memory pressure required to trigger this > scenario ? We would have to test that. > >>> >>> Peter, was there anything fundamentally wrong with your approach based >>> on rseq ? https://lore.kernel.org/lkml/20231030132949.GA38123@noisy.programming.kicks-ass.net >>> >>> The main thing I wonder is whether loading the rseq delay resched flag >>> on return to userspace is too late in your patch. Also, I'm not sure it is >>> realistic to require that no system calls should be done within time extension >>> slice. If we have this scenario: >> I am also not sure if we need to prevent system calls in this scenario. > > I suspect that if we prohibit system calls from being issued from a > delay-resched userspace critical section, then loading the delay-resched > rseq flag on return to userspace is always fine, because the kernel only > reschedules on return from interrupt or trap. > > But I see this no-syscall restriction as being very cumbersome for > userspace. > >> Was that restriction mainly because of restartable sequence API implements it? > > I suspect that this restriction is just to avoid loading the > delay-resched flag from the scheduler when reschedule is called > from an interrupt handler nested over a system call for preemptible > kernels, but Peter could tell us more. Hoping Peter can comment on this. > One open question here is whether we want to pin memory for > each thread in the system to hold this shared data between > userspace and the scheduler. AFAIU, this is a trade-off between > slice extension accuracy in high memory usage scenarios and > pinned memory footprint impact. If the tradeoff goes towards > making this memory pinned, then we may want to consider pinning > the per-thread rseq area on rseq registration. > > Another option to consider is to use rseq to index a userspace > per-cpu data structure, which will be used as shared memory > between kernel and userspace. Userspace could store this > delay-resched flag into the current CPU's shared data, and > the scheduler could load it from there. If pinning per-cpu > data is more acceptable than pinning per-thread data, then > it could be an improvement. Interesting. Having pre cpu shared memory may not work for all use cases. Thanks, -Prakash > > This could be a new ABI between kernel and userspace, e.g.: > > struct rseq_percpu_area { > __u32 sched_state; /* includes time slice extension flag. */ > char end[]; > }; > > Registered to the kernel with the following parameters: > > - Address of rseq_percpu_area for CPU 0, > - The stride of the per-cpu indexing (see librseq mempool per-cpu > allocator [1]), > - offsetof(struct rseq_percpu_area, end) to have the feature size > for extensibility. > > Thanks, > > Mathieu > > [1] https://lpc.events/event/18/contributions/1720/attachments/1572/3268/presentation-lpc2024-rseq-mempool.pdf > >> -Prakash >>> >>> A) userspace grabs lock >>> - set rseq delay resched flag >>> B) syscall >>> - reschedule >>> [...] >>> - return to userspace, load rseq delay-resched flag from userspace (too late) >>> >>> I would have thought loading the delay resched flag should be attempted much >>> earlier in the scheduler code. Perhaps we could do this from a page fault >>> disable critical section, and accept that this hint may be a no-op if the >>> rseq page happens to be swapped out (which is really unlikely). This is >>> similar to the "on_cpu" sched state rseq extension RFC I posted a while back, >>> which needed to be accessed from the scheduler: >>> >>> https://lore.kernel.org/lkml/20230517152654.7193-1-mathieu.desnoyers@efficios.com/ >>> https://lore.kernel.org/lkml/20230529191416.53955-1-mathieu.desnoyers@efficios.com/ >>> >>> And we'd leave the delay-resched load in place on return to userspace, so >>> in the unlikely scenario where it is swapped out, at least it gets paged >>> back at that point. >>> >>> Feel free to let me know if I'm missing an important point and/or saying >>> nonsense here. >>> >>> Thanks, >>> >>> Mathieu >>> >>> -- >>> Mathieu Desnoyers >>> EfficiOS Inc. >>> https://www.efficios.com >>> > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com
Hello Prakash, Few questions around the benchmarks. On 11/13/2024 5:31 AM, Prakash Sangappa wrote: > [..snip..] > > Test results: > ============= > Test system 2 socket AMD Genoa > > Lock table test:- a simple database test to grab table lock(spin lock). > Simulates sql query executions. > 300 clients + 400 cpu hog tasks to generate load. Have you tried running the 300 clients with a nice value of -20 and 400 CPU hogs with the default nice value / nice 19? Does that help this particular case? > > Without extension : 182K SQL exec/sec > With extension : 262K SQL exec/sec > 44% improvement. > > Swingbench - standard database benchmark > Cached(database files on tmpfs) run, with 1000 clients. In this case, how does the performance fare when running the clients under SCHED_BATCH? What does the "TASK_PREEMPT_DELAY_REQ" count vs "TASK_PREEMPT_DELAY_GRANTED" count look like for the benchmark run? I'm trying to understand what the performance looks like when using existing features that inhibit preemption vs putting forward the preemption when the userspace is holding a lock. Feel free to quote the latency comparisons too if using the existing features lead to unacceptable avg/tail latencies. > > Without extension : 99K SQL exec/sec > with extension : 153K SQL exec/sec > 55% improvement in throughput. > > [..snip..] > -- Thanks and Regards, Prateek
> On Nov 12, 2024, at 9:43 PM, K Prateek Nayak <kprateek.nayak@amd.com> wrote: > > Hello Prakash, > > Few questions around the benchmarks. > > On 11/13/2024 5:31 AM, Prakash Sangappa wrote: >> [..snip..] Test results: >> ============= >> Test system 2 socket AMD Genoa >> Lock table test:- a simple database test to grab table lock(spin lock). >> Simulates sql query executions. >> 300 clients + 400 cpu hog tasks to generate load. > > Have you tried running the 300 clients with a nice value of -20 and 400 > CPU hogs with the default nice value / nice 19? Does that help this > particular case? Have not tried this with the database. Will have to try it. > >> Without extension : 182K SQL exec/sec >> With extension : 262K SQL exec/sec >> 44% improvement. >> Swingbench - standard database benchmark >> Cached(database files on tmpfs) run, with 1000 clients. > > In this case, how does the performance fare when running the clients > under SCHED_BATCH? What does the "TASK_PREEMPT_DELAY_REQ" count vs > "TASK_PREEMPT_DELAY_GRANTED" count look like for the benchmark run? Not tried SCHED_BATCH. With this run, there were about avg ‘166' TASK_PREEMPT_DELAY_GRANTED grants per task, collected from the scheduler stats captured at the end of the run. Test runs for about 5min. Don't have the count of how many times preempt delay was requested. If the task completes the critical section, it clears the TASK_PREEMPT_DELAY_REQ flag, so kernel would not see it many cases as this may not be near the end of the time slice. We would have to capture the count in the application. > > I'm trying to understand what the performance looks like when using > existing features that inhibit preemption vs putting forward the > preemption when the userspace is holding a lock. Feel free to quote > the latency comparisons too if using the existing features lead to > unacceptable avg/tail latencies. > >> Without extension : 99K SQL exec/sec >> with extension : 153K SQL exec/sec >> 55% improvement in throughput. >> [..snip..] > > -- > Thanks and Regards, > Prateek
© 2016 - 2024 Red Hat, Inc.