include/linux/sched.h | 2 ++ kernel/sched/core.c | 1 + kernel/sched/deadline.c | 4 +++- kernel/sched/debug.c | 2 ++ kernel/sched/ext.c | 4 +++- kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++-- kernel/sched/rt.c | 3 ++- kernel/sched/sched.h | 4 +++- kernel/sched/stop_task.c | 2 +- kernel/sched/syscalls.c | 9 ++++++++- 10 files changed, 58 insertions(+), 8 deletions(-)
Problem statement ================= Calls to sched_yield() can touch data shared with other threads. Because of this, userspace threads could generate high levels of contention by calling sched_yield() in a tight loop from multiple cores. For example, if cputimer is enabled for a process (e.g. through setitimer(ITIMER_PROF, ...)), all threads of that process will do an atomic add on the per-process field p->signal->cputimer->cputime_atomic.sum_exec_runtime inside account_group_exec_runtime(), which is called inside update_curr(). Currently, calling sched_yield() will always call update_curr() at least once in schedule(), and potentially one more time in yield_task_fair(). Thus, userspace threads can generate quite a lot of contention for the cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of a process call sched_yield() in a tight loop. At Google, we suspect that this contention led to a full machine lockup in at least one instance, with ~50% of CPU cycles spent in the atomic add inside account_group_exec_runtime() according to `perf record -a -e cycles`. Proposed solution ================= To alleviate the contention, this patchset introduces the ability to limit how frequently a thread is allowed to yield. It adds a new sched debugfs knob called yield_interval_ns. A thread is allowed to yield at most once every yield_interval_ns nanoseconds. Subsequent calls to sched_yield() within the interval simply return without calling schedule(). The default value of the knob is 0, meaning the throttling feature is disabled by default. Performance =========== To test the impact on performance and contention, we used a benchmark consisting of a process with a profiling timer enabled and N threads sequentially assigned to logical cores, with 2 threads per core. Each thread calls sched_yield() in a tight loop. We measured the total number of unthrottled sched_yield() calls made by all threads within a fixed time. In addition, we recorded the benchmark runs with `perf record -a -g -e cycles`. We used the perf data to determine the percentage of CPU time spent in the problematic atomic add instruction and used that as a measure of contention. We ran the benchmark on an Intel Emerald Rapids CPU with 60 physical cores. With throttling disabled, there was no measurable performance impact to sched_yield(). Setting the interval to 1ns, which enables the throttling code but doesn't actually throttle any calls to sched_yield(), results in a 1-3% penalty for sched_yield() with low thread counts, but disappears quickly as the thread count gets higher and contention becomes more of a factor. With throttling disabled, CPU time spent in the atomic add instruction for N=80 threads is roughly 80%. By setting yield_interval_ns to 10000, the percentage decreases to 1-2%, but the total number of unthrottled sched_yield() calls decreases by ~60%. Alternatives considered ======================= An alternative we considered was to make the cputime accounting more scalable by accumulating a thread's cputime locally in task_struct and flushing it to the process-wide cputime when it reaches some threshold value or when the thread is taken off the CPU. However, we determined that the implementation is too intrusive compared to the benefit it provided. It also wouldn't address other potential points of contention on the sched_yield() path. Kuba Piecuch (3): sched: add bool return value to sched_class::yield_task() sched/fair: don't schedule() in yield if nr_running == 1 sched/fair: add debugfs knob for yield throttling include/linux/sched.h | 2 ++ kernel/sched/core.c | 1 + kernel/sched/deadline.c | 4 +++- kernel/sched/debug.c | 2 ++ kernel/sched/ext.c | 4 +++- kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++-- kernel/sched/rt.c | 3 ++- kernel/sched/sched.h | 4 +++- kernel/sched/stop_task.c | 2 +- kernel/sched/syscalls.c | 9 ++++++++- 10 files changed, 58 insertions(+), 8 deletions(-) -- 2.51.0.rc0.155.g4a0f42376b-goog
On Fri, Aug 08, 2025 at 08:02:47PM +0000, Kuba Piecuch wrote: > Problem statement > ================= > > Calls to sched_yield() can touch data shared with other threads. > Because of this, userspace threads could generate high levels of contention > by calling sched_yield() in a tight loop from multiple cores. > > For example, if cputimer is enabled for a process (e.g. through > setitimer(ITIMER_PROF, ...)), all threads of that process > will do an atomic add on the per-process field > p->signal->cputimer->cputime_atomic.sum_exec_runtime inside > account_group_exec_runtime(), which is called inside update_curr(). > > Currently, calling sched_yield() will always call update_curr() at least > once in schedule(), and potentially one more time in yield_task_fair(). > Thus, userspace threads can generate quite a lot of contention for the > cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of > a process call sched_yield() in a tight loop. > > At Google, we suspect that this contention led to a full machine lockup in > at least one instance, with ~50% of CPU cycles spent in the atomic add > inside account_group_exec_runtime() according to > `perf record -a -e cycles`. I've gotta ask, WTH is your userspace calling yield() so much?
On Mon, Aug 11, 2025 at 10:36 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Aug 08, 2025 at 08:02:47PM +0000, Kuba Piecuch wrote: > > Problem statement > > ================= > > > > Calls to sched_yield() can touch data shared with other threads. > > Because of this, userspace threads could generate high levels of contention > > by calling sched_yield() in a tight loop from multiple cores. > > > > For example, if cputimer is enabled for a process (e.g. through > > setitimer(ITIMER_PROF, ...)), all threads of that process > > will do an atomic add on the per-process field > > p->signal->cputimer->cputime_atomic.sum_exec_runtime inside > > account_group_exec_runtime(), which is called inside update_curr(). > > > > Currently, calling sched_yield() will always call update_curr() at least > > once in schedule(), and potentially one more time in yield_task_fair(). > > Thus, userspace threads can generate quite a lot of contention for the > > cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of > > a process call sched_yield() in a tight loop. > > > > At Google, we suspect that this contention led to a full machine lockup in > > at least one instance, with ~50% of CPU cycles spent in the atomic add > > inside account_group_exec_runtime() according to > > `perf record -a -e cycles`. > > I've gotta ask, WTH is your userspace calling yield() so much? The code calling sched_yield() was in the wait loop for a spinlock. It would repeatedly yield until the compare-and-swap instruction succeeded in acquiring the lock. This code runs in the SIGPROF handler.
On Mon, Aug 11, 2025 at 03:35:35PM +0200, Kuba Piecuch wrote: > On Mon, Aug 11, 2025 at 10:36 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Fri, Aug 08, 2025 at 08:02:47PM +0000, Kuba Piecuch wrote: > > > Problem statement > > > ================= > > > > > > Calls to sched_yield() can touch data shared with other threads. > > > Because of this, userspace threads could generate high levels of contention > > > by calling sched_yield() in a tight loop from multiple cores. > > > > > > For example, if cputimer is enabled for a process (e.g. through > > > setitimer(ITIMER_PROF, ...)), all threads of that process > > > will do an atomic add on the per-process field > > > p->signal->cputimer->cputime_atomic.sum_exec_runtime inside > > > account_group_exec_runtime(), which is called inside update_curr(). > > > > > > Currently, calling sched_yield() will always call update_curr() at least > > > once in schedule(), and potentially one more time in yield_task_fair(). > > > Thus, userspace threads can generate quite a lot of contention for the > > > cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of > > > a process call sched_yield() in a tight loop. > > > > > > At Google, we suspect that this contention led to a full machine lockup in > > > at least one instance, with ~50% of CPU cycles spent in the atomic add > > > inside account_group_exec_runtime() according to > > > `perf record -a -e cycles`. > > > > I've gotta ask, WTH is your userspace calling yield() so much? > > The code calling sched_yield() was in the wait loop for a spinlock. It > would repeatedly yield until the compare-and-swap instruction succeeded > in acquiring the lock. This code runs in the SIGPROF handler. Well, then don't do that... userspace spinlocks are terrible, and bashing yield like that isn't helpful either. Throttling yield seems like entirely the wrong thing to do. Yes, yield() is poorly defined (strictly speaking UB for anything not FIFO/RR) but making it actively worse doesn't seem helpful. The whole itimer thing is not scalable -- blaming that on yield seems hardly fair. Why not use timer_create(), with CLOCK_THREAD_CPUTIME_ID and SIGEV_SIGNAL instead?
On Thu, Aug 14, 2025 at 4:53 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Aug 11, 2025 at 03:35:35PM +0200, Kuba Piecuch wrote: > > On Mon, Aug 11, 2025 at 10:36 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Fri, Aug 08, 2025 at 08:02:47PM +0000, Kuba Piecuch wrote: > > > > Problem statement > > > > ================= > > > > > > > > Calls to sched_yield() can touch data shared with other threads. > > > > Because of this, userspace threads could generate high levels of contention > > > > by calling sched_yield() in a tight loop from multiple cores. > > > > > > > > For example, if cputimer is enabled for a process (e.g. through > > > > setitimer(ITIMER_PROF, ...)), all threads of that process > > > > will do an atomic add on the per-process field > > > > p->signal->cputimer->cputime_atomic.sum_exec_runtime inside > > > > account_group_exec_runtime(), which is called inside update_curr(). > > > > > > > > Currently, calling sched_yield() will always call update_curr() at least > > > > once in schedule(), and potentially one more time in yield_task_fair(). > > > > Thus, userspace threads can generate quite a lot of contention for the > > > > cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of > > > > a process call sched_yield() in a tight loop. > > > > > > > > At Google, we suspect that this contention led to a full machine lockup in > > > > at least one instance, with ~50% of CPU cycles spent in the atomic add > > > > inside account_group_exec_runtime() according to > > > > `perf record -a -e cycles`. > > > > > > I've gotta ask, WTH is your userspace calling yield() so much? > > > > The code calling sched_yield() was in the wait loop for a spinlock. It > > would repeatedly yield until the compare-and-swap instruction succeeded > > in acquiring the lock. This code runs in the SIGPROF handler. > > Well, then don't do that... userspace spinlocks are terrible, and > bashing yield like that isn't helpful either. > > Throttling yield seems like entirely the wrong thing to do. Yes, yield() > is poorly defined (strictly speaking UB for anything not FIFO/RR) but > making it actively worse doesn't seem helpful. > > The whole itimer thing is not scalable -- blaming that on yield seems > hardly fair. > > Why not use timer_create(), with CLOCK_THREAD_CPUTIME_ID and > SIGEV_SIGNAL instead? I agree that there are userspace changes we can make to reduce contention and prevent future lockups. What that doesn't address is the potential for userspace to trigger kernel lockups, maliciously or unintentionally, via spamming yield(). This patch series introduces a way to reduce contention and risk of userspace-induced lockups regardless of userspace behavior -- that's the value proposition.
On Tue, Aug 19, 2025 at 4:08 PM Kuba Piecuch <jpiecuch@google.com> wrote: > > On Thu, Aug 14, 2025 at 4:53 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Mon, Aug 11, 2025 at 03:35:35PM +0200, Kuba Piecuch wrote: > > > On Mon, Aug 11, 2025 at 10:36 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > On Fri, Aug 08, 2025 at 08:02:47PM +0000, Kuba Piecuch wrote: > > > > > Problem statement > > > > > ================= > > > > > > > > > > Calls to sched_yield() can touch data shared with other threads. > > > > > Because of this, userspace threads could generate high levels of contention > > > > > by calling sched_yield() in a tight loop from multiple cores. > > > > > > > > > > For example, if cputimer is enabled for a process (e.g. through > > > > > setitimer(ITIMER_PROF, ...)), all threads of that process > > > > > will do an atomic add on the per-process field > > > > > p->signal->cputimer->cputime_atomic.sum_exec_runtime inside > > > > > account_group_exec_runtime(), which is called inside update_curr(). > > > > > > > > > > Currently, calling sched_yield() will always call update_curr() at least > > > > > once in schedule(), and potentially one more time in yield_task_fair(). > > > > > Thus, userspace threads can generate quite a lot of contention for the > > > > > cacheline containing cputime_atomic.sum_exec_runtime if multiple threads of > > > > > a process call sched_yield() in a tight loop. > > > > > > > > > > At Google, we suspect that this contention led to a full machine lockup in > > > > > at least one instance, with ~50% of CPU cycles spent in the atomic add > > > > > inside account_group_exec_runtime() according to > > > > > `perf record -a -e cycles`. > > > > > > > > I've gotta ask, WTH is your userspace calling yield() so much? > > > > > > The code calling sched_yield() was in the wait loop for a spinlock. It > > > would repeatedly yield until the compare-and-swap instruction succeeded > > > in acquiring the lock. This code runs in the SIGPROF handler. > > > > Well, then don't do that... userspace spinlocks are terrible, and > > bashing yield like that isn't helpful either. > > > > Throttling yield seems like entirely the wrong thing to do. Yes, yield() > > is poorly defined (strictly speaking UB for anything not FIFO/RR) but > > making it actively worse doesn't seem helpful. > > > > The whole itimer thing is not scalable -- blaming that on yield seems > > hardly fair. > > > > Why not use timer_create(), with CLOCK_THREAD_CPUTIME_ID and > > SIGEV_SIGNAL instead? > > I agree that there are userspace changes we can make to reduce contention > and prevent future lockups. What that doesn't address is the potential for > userspace to trigger kernel lockups, maliciously or unintentionally, via > spamming yield(). This patch series introduces a way to reduce contention > and risk of userspace-induced lockups regardless of userspace behavior > -- that's the value proposition. At a more basic level, we need to agree that there's a kernel issue here that should be resolved: userspace potentially being able to trigger a hard lockup via suboptimal/inappropriate use of syscalls. Not long ago, there was a similar issue involving getrusage() [1]: a process with many threads was causing hard lockups when the threads were calling getrusage() too frequently. You could've said "don't call getrusage() so much", but that would be addressing a symptom, not the cause. Granted, the fix in that case [2] was more elegant and less hacky than what I'm proposing here, but there are alternative approaches that we can pursue. We just need to agree that there's a problem in the kernel that needs to be solved. [1]: https://lore.kernel.org/all/20240117192534.1327608-1-dylanbhatch@google.com/ [2]: https://lore.kernel.org/all/20240122155023.GA26169@redhat.com/
On Thu, 14 Aug 2025 16:53:08 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Aug 11, 2025 at 03:35:35PM +0200, Kuba Piecuch wrote: > > On Mon, Aug 11, 2025 at 10:36 AM Peter Zijlstra <peterz@infradead.org> wrote: ... > > The code calling sched_yield() was in the wait loop for a spinlock. It > > would repeatedly yield until the compare-and-swap instruction succeeded > > in acquiring the lock. This code runs in the SIGPROF handler. > > Well, then don't do that... userspace spinlocks are terrible, and > bashing yield like that isn't helpful either. All it takes is the kernel to take a hardware interrupt while your 'spin lock' is held and any other thread trying to acquire the lock will sit at 100% cpu until all the interrupt work finishes. A typical ethernet interrupt will schedule more work from a softint context, with non-threaded napi you have to wait for that to finish. That can all take milliseconds. The same is true for a futex based lock - but at least the waiting threads sleep. Pretty much the only solution it to replace the userspace locks with atomic operations (and hope the atomics make progress). I'm pretty sure it only makes sense to have spin locks that disable interrupts. David
© 2016 - 2025 Red Hat, Inc.