[PATCH v3 0/5] Defer throttle when task exits to user

Aaron Lu posted 5 patches 2 months, 3 weeks ago
There is a newer version of this series
include/linux/sched.h |   5 +
kernel/sched/core.c   |   3 +
kernel/sched/fair.c   | 451 ++++++++++++++++++++++++------------------
kernel/sched/pelt.h   |   4 +-
kernel/sched/sched.h  |   7 +-
5 files changed, 274 insertions(+), 196 deletions(-)
[PATCH v3 0/5] Defer throttle when task exits to user
Posted by Aaron Lu 2 months, 3 weeks ago
v3:
- Keep throttled cfs_rq's PELT clock running as long as it still has
  entity queued, suggested by Benjamin Segall. I've folded this change
  into patch3;
- Rebased on top of tip/sched/core, commit 2885daf47081
  ("lib/smp_processor_id: Make migration check unconditional of SMP").

Hi Prateek,
I've kept your tested-by tag(Thanks!) for v2 since I believe this pelt
clock change should not affect things much, but let me know if you don't
think that is appropriate.

Tests I've done:
- Jan's rt deadlock reproducer[1]. Without this series, I saw rcu-stalls
  within 2 minutes and with this series, I do not see rcu-stalls after
  10 minutes.
- A stress test that creates a lot of pressure on fork/exit path and
  cgroup_threadgroup_rwsem. Without this series, the test will cause
  task hung in about 5 minutes and with this series, no problem found
  after several hours. Songtang wrote this test script and I've used it
  to verify the patches, thanks Songtang.

[1]: https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/

v2:
- Re-org the patchset to use a single patch to implement throttle
  related changes, suggested by Chengming;
- Use check_cfs_rq_runtime()'s return value in pick_task_fair() to
  decide if throttle task work is needed instead of checking
  throttled_hierarchy(), suggested by Peter;
- Simplify throttle_count check in tg_throtthe_down() and
  tg_unthrottle_up(), suggested by Peter;
- Add enqueue_throttled_task() to speed up enqueuing a throttled task to
  a throttled cfs_rq, suggested by Peter;
- Address the missing of detach_task_cfs_rq() for throttled tasks that
  get migrated to a new rq, pointed out by Chengming;
- Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as
  cond_resched*() is going away, pointed out by Peter.
I hope I didn't miss any comments and suggestions for v1 and if I do,
please kindly let me know, thanks!

Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version
of double_rq_clock_clear_update()")

cover letter of v1:

This is a continuous work based on Valentin Schneider's posting here:
Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/

Valentin has described the problem very well in the above link and I
quote:
"
CFS tasks can end up throttled while holding locks that other,
non-throttled tasks are blocking on.

For !PREEMPT_RT, this can be a source of latency due to the throttling
causing a resource acquisition denial.

For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter
the slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)

If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued
on the same CPU as one where ktimers/ksoftirqd is blocked on
read_lock(&lock), this creates a circular dependency.

This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.

The linux-rt tree has had
  1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
which helped this scenario for non-rwlock locks by ensuring the throttled
task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
rwlocks cannot sanely do PI as they allow multiple readers.
"

Jan Kiszka has posted an reproducer regarding this PREEMPT_RT problem :
https://lore.kernel.org/r/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
and K Prateek Nayak has an detailed analysis of how deadlock happened:
https://lore.kernel.org/r/e65a32af-271b-4de6-937a-1a1049bbf511@amd.com/

To fix this issue for PREEMPT_RT and improve latency situation for
!PREEMPT_RT, change the throttle model to task based, i.e. when a cfs_rq
is throttled, mark its throttled status but do not remove it from cpu's
rq. Instead, for tasks that belong to this cfs_rq, when they get picked,
add a task work to them so that when they return to user, they can be
dequeued. In this way, tasks throttled will not hold any kernel resources.
When cfs_rq gets unthrottled, enqueue back those throttled tasks.

There are consequences because of this new throttle model, e.g. for a
cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
return2user path, one task still running in kernel mode, this cfs_rq is
in a partial throttled state:
- Should its pelt clock be frozen?
- Should this state be accounted into throttled_time?

For pelt clock, I chose to keep the current behavior to freeze it on
cfs_rq's throttle time. The assumption is that tasks running in kernel
mode should not last too long, freezing the cfs_rq's pelt clock can keep
its load and its corresponding sched_entity's weight. Hopefully, this can
result in a stable situation for the remaining running tasks to quickly
finish their jobs in kernel mode.

For throttle time accounting, according to RFC v2's feedback, rework
throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its
  hierarchy;
- stop accounting on unthrottle.

There is also the concern of increased duration of (un)throttle operations
in RFC v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
setup on a 2sockets/384cpus AMD server, the longest duration of
distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
For throttle path, with Chengming's suggestion to move "task work setup"
from throttle time to pick time, it's not an issue anymore.

Aaron Lu (2):
  sched/fair: Task based throttle time accounting
  sched/fair: Get rid of throttled_lb_pair()

Valentin Schneider (3):
  sched/fair: Add related data structure for task based throttle
  sched/fair: Implement throttle task work and related helpers
  sched/fair: Switch to task based throttle model

 include/linux/sched.h |   5 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 451 ++++++++++++++++++++++++------------------
 kernel/sched/pelt.h   |   4 +-
 kernel/sched/sched.h  |   7 +-
 5 files changed, 274 insertions(+), 196 deletions(-)

-- 
2.39.5
Re: [PATCH v3 0/5] Defer throttle when task exits to user
Posted by Matteo Martelli 2 months ago
Hi Aaron,

On Tue, 15 Jul 2025 15:16:53 +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> v3:
> - Keep throttled cfs_rq's PELT clock running as long as it still has
>   entity queued, suggested by Benjamin Segall. I've folded this change
>   into patch3;
> - Rebased on top of tip/sched/core, commit 2885daf47081
>   ("lib/smp_processor_id: Make migration check unconditional of SMP").
> 
> Hi Prateek,
> I've kept your tested-by tag(Thanks!) for v2 since I believe this pelt
> clock change should not affect things much, but let me know if you don't
> think that is appropriate.
> 
> Tests I've done:
> - Jan's rt deadlock reproducer[1]. Without this series, I saw rcu-stalls
>   within 2 minutes and with this series, I do not see rcu-stalls after
>   10 minutes.
> - A stress test that creates a lot of pressure on fork/exit path and
>   cgroup_threadgroup_rwsem. Without this series, the test will cause
>   task hung in about 5 minutes and with this series, no problem found
>   after several hours. Songtang wrote this test script and I've used it
>   to verify the patches, thanks Songtang.
> 
> [1]: https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
> 
> v2:
> - Re-org the patchset to use a single patch to implement throttle
>   related changes, suggested by Chengming;
> - Use check_cfs_rq_runtime()'s return value in pick_task_fair() to
>   decide if throttle task work is needed instead of checking
>   throttled_hierarchy(), suggested by Peter;
> - Simplify throttle_count check in tg_throtthe_down() and
>   tg_unthrottle_up(), suggested by Peter;
> - Add enqueue_throttled_task() to speed up enqueuing a throttled task to
>   a throttled cfs_rq, suggested by Peter;
> - Address the missing of detach_task_cfs_rq() for throttled tasks that
>   get migrated to a new rq, pointed out by Chengming;
> - Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as
>   cond_resched*() is going away, pointed out by Peter.
> I hope I didn't miss any comments and suggestions for v1 and if I do,
> please kindly let me know, thanks!
> 
> Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version
> of double_rq_clock_clear_update()")
> 
> cover letter of v1:
> 
> This is a continuous work based on Valentin Schneider's posting here:
> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
> 
> Valentin has described the problem very well in the above link and I
> quote:
> "
> CFS tasks can end up throttled while holding locks that other,
> non-throttled tasks are blocking on.
> 
> For !PREEMPT_RT, this can be a source of latency due to the throttling
> causing a resource acquisition denial.
> 
> For PREEMPT_RT, this is worse and can lead to a deadlock:
> o A CFS task p0 gets throttled while holding read_lock(&lock)
> o A task p1 blocks on write_lock(&lock), making further readers enter
> the slowpath
> o A ktimers or ksoftirqd task blocks on read_lock(&lock)
> 
> If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued
> on the same CPU as one where ktimers/ksoftirqd is blocked on
> read_lock(&lock), this creates a circular dependency.
> 
> This has been observed to happen with:
> o fs/eventpoll.c::ep->lock
> o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
> but can trigger with any rwlock that can be acquired in both process and
> softirq contexts.
> 
> The linux-rt tree has had
>   1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
> which helped this scenario for non-rwlock locks by ensuring the throttled
> task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
> rwlocks cannot sanely do PI as they allow multiple readers.
> "
> 
> Jan Kiszka has posted an reproducer regarding this PREEMPT_RT problem :
> https://lore.kernel.org/r/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
> and K Prateek Nayak has an detailed analysis of how deadlock happened:
> https://lore.kernel.org/r/e65a32af-271b-4de6-937a-1a1049bbf511@amd.com/
> 
> To fix this issue for PREEMPT_RT and improve latency situation for
> !PREEMPT_RT, change the throttle model to task based, i.e. when a cfs_rq
> is throttled, mark its throttled status but do not remove it from cpu's
> rq. Instead, for tasks that belong to this cfs_rq, when they get picked,
> add a task work to them so that when they return to user, they can be
> dequeued. In this way, tasks throttled will not hold any kernel resources.
> When cfs_rq gets unthrottled, enqueue back those throttled tasks.
> 
> There are consequences because of this new throttle model, e.g. for a
> cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> return2user path, one task still running in kernel mode, this cfs_rq is
> in a partial throttled state:
> - Should its pelt clock be frozen?
> - Should this state be accounted into throttled_time?
> 
> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.
> 
> For throttle time accounting, according to RFC v2's feedback, rework
> throttle time accounting for a cfs_rq as follows:
> - start accounting when the first task gets throttled in its
>   hierarchy;
> - stop accounting on unthrottle.
> 
> There is also the concern of increased duration of (un)throttle operations
> in RFC v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
> setup on a 2sockets/384cpus AMD server, the longest duration of
> distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
> https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
> For throttle path, with Chengming's suggestion to move "task work setup"
> from throttle time to pick time, it's not an issue anymore.
> 
> Aaron Lu (2):
>   sched/fair: Task based throttle time accounting
>   sched/fair: Get rid of throttled_lb_pair()
> 
> Valentin Schneider (3):
>   sched/fair: Add related data structure for task based throttle
>   sched/fair: Implement throttle task work and related helpers
>   sched/fair: Switch to task based throttle model
> 
>  include/linux/sched.h |   5 +
>  kernel/sched/core.c   |   3 +
>  kernel/sched/fair.c   | 451 ++++++++++++++++++++++++------------------
>  kernel/sched/pelt.h   |   4 +-
>  kernel/sched/sched.h  |   7 +-
>  5 files changed, 274 insertions(+), 196 deletions(-)
> 
> -- 
> 2.39.5
> 
> 

I encountered this issue on a test image with both PREEMPT_RT and
CFS_BANDWIDTH kernel options enabled. The test image is based on
freedesktop-sdk (v24.08.10) [1] with custom system configurations on
top, and it was being run on qemu x86_64 with 4 virtual CPU cores. One
notable system configuration is having most of system services running
on a systemd slice, restricted on a single CPU core (with AllowedCPUs
systemd option) and using CFS throttling (with CPUQuota systemd option).
With this configuration I encountered RCU stalls during boots, I think
because of the increased probability given by multiple processes being
spawned simultaneously on the same core. After the first RCU stall, the
system becomes unresponsive and successive RCU stalls are detected
periodically. This seems to match with the deadlock situation described
in your cover letter. I could only reproduce RCU stalls with the
combination of both PREEMPT_RT and CFS_BANDWIDTH enabled.

I previously already tested this patch set at v2 (RFC) [2] on top of
kernel v6.14 and v6.15. I've now retested it at v3 on top of kernel
v6.16-rc7. I could no longer reproduce RCU stalls in all cases with the
patch set applied. More specifically, in the last test I ran, without
patch set applied, I could reproduce 32 RCU stalls in 24 hours, about 1
or 2 every hour. In this test the system was rebooting just after the
first RCU stall occurrence (through panic_on_rcu_stall=1 and panic=5
kernel cmdline arguments) or after 100 seconds if no RCU stall occurred.
This means the system rebooted 854 times in 24 hours (about 3.7%
reproducibility). You can see below two RCU stall instances. I could not
reproduce any RCU stall with the same test after applying the patch set.
I obtained similar results while testing the patch set at v2 (RFC)[1].
Another possibly interesting note is that the original custom
configuration was with the slice CPUQuota=150%, then I retested it with
CPUQuota=80%. The issue was reproducible in both configurations, notably
even with CPUQuota=150% that to my understanding should correspond to no
CFS throttling due to the CPU affinity set to 1 core only.

I also ran some quick tests with stress-ng and systemd CPUQuota parameter to
verify that CFS throttling was behaving as expected. See details below after
RCU stall logs.

I hope this is helpful information and I can provide additional details if
needed.

Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>

[1]: https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/releases/freedesktop-sdk-24.08.10
[2]: https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@bytedance.com/


- RCU stall instances:

...
[   40.083057] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   40.083067] rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-3): P1075/4:b..l
[   40.083070] rcu:     (detected by 0, t=21002 jiffies, g=2637, q=547 ncpus=4)
[   40.083073] task:podman          state:R  running task     stack:13568 pid:1075  tgid:1062  ppid:1021   task_flags:0x40014c flags:0x00004002
[   40.083081] Call Trace:
[   40.083082]  <TASK>
[   40.083084]  __schedule+0x3d4/0xf10
[   40.083100]  preempt_schedule+0x2e/0x50
[   40.083102]  preempt_schedule_thunk+0x16/0x30
[   40.083107]  try_to_wake_up+0x2fc/0x630
[   40.083111]  ep_autoremove_wake_function+0xd/0x40
[   40.083115]  __wake_up_common+0x6d/0x90
[   40.083117]  __wake_up+0x2c/0x50
[   40.083119]  ep_poll_callback+0x17b/0x230
[   40.083121]  __wake_up_common+0x6d/0x90
[   40.083122]  __wake_up+0x2c/0x50
[   40.083123]  sock_def_wakeup+0x3a/0x40
[   40.083128]  unix_release_sock+0x2a7/0x4a0
[   40.083134]  unix_release+0x2d/0x40
[   40.083137]  __sock_release+0x44/0xb0
[   40.083141]  sock_close+0x13/0x20
[   40.083142]  __fput+0xe1/0x2a0
[   40.083146]  task_work_run+0x58/0x90
[   40.083149]  do_exit+0x270/0xac0
[   40.083152]  do_group_exit+0x2b/0xc0
[   40.083153]  __x64_sys_exit_group+0x13/0x20
[   40.083154]  x64_sys_call+0xfdb/0x14f0
[   40.083156]  do_syscall_64+0xa4/0x260
[   40.083160]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   40.083165] RIP: 0033:0x48638b
[   40.083167] RSP: 002b:000000c00004fde0 EFLAGS: 00000216 ORIG_RAX: 00000000000000e7
[   40.083169] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000048638b
[   40.083171] RDX: 000000c00004fdb0 RSI: 0000000000000001 RDI: 0000000000000000
[   40.083171] RBP: 000000c00004fdf0 R08: 4bad2e33de989e9a R09: 0000000002d79c40
[   40.083173] R10: 000000c0005eaa08 R11: 0000000000000216 R12: 0000000000000000
[   40.083173] R13: 0000000000000001 R14: 000000c0000061c0 R15: 000000c0000f27e0
[   40.083175]  </TASK>
[   40.083176] rcu: rcu_preempt kthread timer wakeup didn't happen for 20975 jiffies! g2637 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   40.083179] rcu:     Possible timer handling issue on cpu=0 timer-softirq=1708
[   40.083180] rcu: rcu_preempt kthread starved for 20978 jiffies! g2637 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[   40.083182] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[   40.083183] rcu: RCU grace-period kthread stack dump:
[   40.083183] task:rcu_preempt     state:I stack:14800 pid:17    tgid:17    ppid:2      task_flags:0x208040 flags:0x00004000
[   40.083187] Call Trace:
[   40.083188]  <TASK>
[   40.083189]  __schedule+0x3d4/0xf10
[   40.083192]  schedule+0x22/0xd0
[   40.083194]  schedule_timeout+0x7e/0x100
[   40.083199]  ? __pfx_process_timeout+0x10/0x10
[   40.083202]  rcu_gp_fqs_loop+0x103/0x6b0
[   40.083206]  ? __pfx_rcu_gp_kthread+0x10/0x10
[   40.083207]  rcu_gp_kthread+0x191/0x230
[   40.083208]  kthread+0xf6/0x1f0
[   40.083210]  ? __pfx_kthread+0x10/0x10
[   40.083212]  ret_from_fork+0x80/0xd0
[   40.083215]  ? __pfx_kthread+0x10/0x10
[   40.083217]  ret_from_fork_asm+0x1a/0x30
[   40.083219]  </TASK>
[   40.083220] rcu: Stack dump where RCU GP kthread last ran:
[   40.083225] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.16.0-rc7 #1 PREEMPT_{RT,(full)}
[   40.083227] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[   40.083227] RIP: 0010:pv_native_safe_halt+0xf/0x20
[   40.083229] Code: 76 79 00 e9 83 f5 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 35 68 21 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[   40.083230] RSP: 0018:ffffffff9a003e80 EFLAGS: 00000216
[   40.083231] RAX: ffff9034a1221000 RBX: ffffffff9a018900 RCX: 0000000000000001
[   40.083231] RDX: 4000000000000000 RSI: 0000000000000000 RDI: 000000000007507c
[   40.083232] RBP: 0000000000000000 R08: 000000000007507c R09: ffff90343bc24d90
[   40.083233] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   40.083233] R13: 0000000000000000 R14: ffffffff9a018038 R15: 000000007e0c1000
[   40.083237] FS:  0000000000000000(0000) GS:ffff9034a1221000(0000) knlGS:0000000000000000
[   40.083237] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.083238] CR2: 00007f72d6c58000 CR3: 0000000007722000 CR4: 00000000000006f0
[   40.083239] Call Trace:
[   40.083239]  <TASK>
[   40.083240]  default_idle+0x9/0x10
[   40.083241]  default_idle_call+0x2b/0x100
[   40.083243]  do_idle+0x1d0/0x230
[   40.083244]  cpu_startup_entry+0x24/0x30
[   40.083245]  rest_init+0xbc/0xc0
[   40.083247]  start_kernel+0x6ca/0x6d0
[   40.083252]  x86_64_start_reservations+0x24/0x30
[   40.083255]  x86_64_start_kernel+0xc5/0xd0
[   40.083256]  common_startup_64+0x13e/0x148
[   40.083258]  </TASK>
[   40.083260] Kernel panic - not syncing: RCU Stall
[   40.083261] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.16.0-rc7 #1 PREEMPT_{RT,(full)}
[   40.083263] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[   40.083263] Call Trace:
[   40.083267]  <IRQ>
[   40.083268]  dump_stack_lvl+0x4d/0x70
[   40.083269]  panic+0x10a/0x2b9
[   40.083271]  ? try_to_wake_up+0x2f2/0x630
[   40.083273]  panic_on_rcu_stall.cold+0xc/0xc
[   40.083275]  rcu_sched_clock_irq.cold+0x15f/0x3db
[   40.083277]  ? __pfx_tick_nohz_handler+0x10/0x10
[   40.083279]  update_process_times+0x70/0xb0
[   40.083281]  tick_nohz_handler+0x8c/0x150
[   40.083284]  __hrtimer_run_queues+0x148/0x2e0
[   40.083292]  hrtimer_interrupt+0xf2/0x210
[   40.083294]  __sysvec_apic_timer_interrupt+0x53/0x100
[   40.083296]  sysvec_apic_timer_interrupt+0x66/0x80
[   40.083298]  </IRQ>
[   40.083298]  <TASK>
[   40.083299]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   40.083300] RIP: 0010:pv_native_safe_halt+0xf/0x20
[   40.083301] Code: 76 79 00 e9 83 f5 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 35 68 21 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[   40.083302] RSP: 0018:ffffffff9a003e80 EFLAGS: 00000216
[   40.083303] RAX: ffff9034a1221000 RBX: ffffffff9a018900 RCX: 0000000000000001
[   40.083303] RDX: 4000000000000000 RSI: 0000000000000000 RDI: 000000000007507c
[   40.083304] RBP: 0000000000000000 R08: 000000000007507c R09: ffff90343bc24d90
[   40.083304] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   40.083305] R13: 0000000000000000 R14: ffffffff9a018038 R15: 000000007e0c1000
[   40.083306]  default_idle+0x9/0x10
[   40.083307]  default_idle_call+0x2b/0x100
[   40.083309]  do_idle+0x1d0/0x230
[   40.083310]  cpu_startup_entry+0x24/0x30
[   40.083311]  rest_init+0xbc/0xc0
[   40.083312]  start_kernel+0x6ca/0x6d0
[   40.083313]  x86_64_start_reservations+0x24/0x30
[   40.083315]  x86_64_start_kernel+0xc5/0xd0
[   40.083316]  common_startup_64+0x13e/0x148
[   40.083317]  </TASK>
[   40.083440] Kernel Offset: 0x17600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
...

...
[   40.057080] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   40.057091] rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-3): P1035/4:b..l P981/4:b..l
[   40.057096] rcu:     (detected by 0, t=21002 jiffies, g=2965, q=707 ncpus=4)
[   40.057100] task:systemd         state:R  running task     stack:12856 pid:981   tgid:981   ppid:1      task_flags:0x400100 flags:0x0000
[   40.057109] Call Trace:
[   40.057110]  <TASK>
[   40.057114]  __schedule+0x3d4/0xf10
[   40.057132]  preempt_schedule+0x2e/0x50
[   40.057134]  preempt_schedule_thunk+0x16/0x30
[   40.057140]  try_to_wake_up+0x2fc/0x630
[   40.057145]  ep_autoremove_wake_function+0xd/0x40
[   40.057150]  __wake_up_common+0x6d/0x90
[   40.057152]  __wake_up_sync+0x33/0x50
[   40.057154]  ep_poll_callback+0xcd/0x230
[   40.057156]  __wake_up_common+0x6d/0x90
[   40.057158]  __wake_up_sync_key+0x3a/0x50
[   40.057160]  sock_def_readable+0x3d/0xb0
[   40.057166]  unix_dgram_sendmsg+0x454/0x800
[   40.057174]  ____sys_sendmsg+0x317/0x350
[   40.057179]  ___sys_sendmsg+0x94/0xe0
[   40.057182]  __sys_sendmsg+0x85/0xe0
[   40.057187]  do_syscall_64+0xa4/0x260
[   40.057192]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   40.057197] RIP: 0033:0x7f356c320d94
[   40.057198] RSP: 002b:00007ffc4a68a438 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[   40.057200] RAX: ffffffffffffffda RBX: 000000000000001f RCX: 00007f356c320d94
[   40.057202] RDX: 0000000000004000 RSI: 00007ffc4a68a490 RDI: 000000000000001f
[   40.057203] RBP: 00007ffc4a68a620 R08: 0000000000000080 R09: 0000000000000007
[   40.057204] R10: 00007ffc4a68a3f4 R11: 0000000000000202 R12: 0000000000000000
[   40.057205] R13: 00007ffc4a68a490 R14: 0000000000000000 R15: 0000000000000002
[   40.057207]  </TASK>
[   40.057207] task:(sd-close)      state:D stack:14576 pid:1035  tgid:1035  ppid:1      task_flags:0x40004c flags:0x00004002
[   40.057211] Call Trace:
[   40.057212]  <TASK>
[   40.057212]  __schedule+0x3d4/0xf10
[   40.057215]  schedule_rtlock+0x15/0x30
[   40.057218]  rtlock_slowlock_locked+0x314/0xea0
[   40.057224]  rt_spin_lock+0x79/0xd0
[   40.057226]  __wake_up+0x1a/0x50
[   40.057227]  ep_poll_callback+0x17b/0x230
[   40.057230]  __wake_up_common+0x6d/0x90
[   40.057232]  __wake_up+0x2c/0x50
[   40.057233]  __send_signal_locked+0x417/0x430
[   40.057237]  ? rt_spin_unlock+0x12/0x40
[   40.057239]  ? rt_spin_lock+0x33/0xd0
[   40.057242]  do_notify_parent+0x24a/0x2a0
[   40.057244]  do_exit+0x7cc/0xac0
[   40.057247]  __x64_sys_exit+0x16/0x20
[   40.057249]  x64_sys_call+0xfe9/0x14f0
[   40.057251]  do_syscall_64+0xa4/0x260
[   40.057253]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   40.057254] RIP: 0033:0x7fb2358f97de
[   40.057255] RSP: 002b:00007fffd71b3000 EFLAGS: 00000246 ORIG_RAX: 000000000000003c
[   40.057257] RAX: ffffffffffffffda RBX: 00007fb235adb6e0 RCX: 00007fb2358f97de
[   40.057258] RDX: 000055a50e51881b RSI: 00007fb235d70293 RDI: 0000000000000000
[   40.057259] RBP: 0000000000000000 R08: 0000000000000007 R09: 0000000000000007
[   40.057260] R10: 00007fb2358f97c6 R11: 0000000000000246 R12: 0000000000000019
[   40.057261] R13: 0000000000000411 R14: 000000000000003d R15: 000055a05454c3a0
[   40.057263]  </TASK>
[   40.057264] rcu: rcu_preempt kthread timer wakeup didn't happen for 20963 jiffies! g2965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[   40.057266] rcu:     Possible timer handling issue on cpu=0 timer-softirq=1618
[   40.057267] rcu: rcu_preempt kthread starved for 20966 jiffies! g2965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[   40.057269] rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[   40.057269] rcu: RCU grace-period kthread stack dump:
[   40.057270] task:rcu_preempt     state:I stack:14800 pid:17    tgid:17    ppid:2      task_flags:0x208040 flags:0x00004000
[   40.057274] Call Trace:
[   40.057274]  <TASK>
[   40.057275]  __schedule+0x3d4/0xf10
[   40.057278]  schedule+0x22/0xd0
[   40.057280]  schedule_timeout+0x7e/0x100
[   40.057282]  ? __pfx_process_timeout+0x10/0x10
[   40.057285]  rcu_gp_fqs_loop+0x103/0x6b0
[   40.057291]  ? __pfx_rcu_gp_kthread+0x10/0x10
[   40.057292]  rcu_gp_kthread+0x191/0x230
[   40.057294]  kthread+0xf6/0x1f0
[   40.057296]  ? __pfx_kthread+0x10/0x10
[   40.057298]  ret_from_fork+0x80/0xd0
[   40.057303]  ? __pfx_kthread+0x10/0x10
[   40.057305]  ret_from_fork_asm+0x1a/0x30
[   40.057308]  </TASK>
[   40.057308] rcu: Stack dump where RCU GP kthread last ran:
[   40.057314] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.16.0-rc7 #1 PREEMPT_{RT,(full)}
[   40.057316] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[   40.057317] RIP: 0010:pv_native_safe_halt+0xf/0x20
[   40.057319] Code: 76 79 00 e9 83 f5 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 35 68 21 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[   40.057321] RSP: 0018:ffffffff97a03e80 EFLAGS: 00000212
[   40.057322] RAX: ffff9269a3821000 RBX: ffffffff97a18900 RCX: 0000000000000001
[   40.057323] RDX: 4000000000000000 RSI: 0000000000000000 RDI: 00000000000759bc
[   40.057324] RBP: 0000000000000000 R08: 00000000000759bc R09: ffff92693bc24d90
[   40.057325] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   40.057326] R13: 0000000000000000 R14: ffffffff97a18038 R15: 000000007e0c1000
[   40.057330] FS:  0000000000000000(0000) GS:ffff9269a3821000(0000) knlGS:0000000000000000
[   40.057331] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.057332] CR2: 00007f0546942030 CR3: 000000000746c000 CR4: 00000000000006f0
[   40.057333] Call Trace:
[   40.057334]  <TASK>
[   40.057334]  default_idle+0x9/0x10
[   40.057336]  default_idle_call+0x2b/0x100
[   40.057338]  do_idle+0x1d0/0x230
[   40.057341]  cpu_startup_entry+0x24/0x30
[   40.057342]  rest_init+0xbc/0xc0
[   40.057344]  start_kernel+0x6ca/0x6d0
[   40.057350]  x86_64_start_reservations+0x24/0x30
[   40.057354]  x86_64_start_kernel+0xc5/0xd0
[   40.057355]  common_startup_64+0x13e/0x148
[   40.057358]  </TASK>
[   40.057361] Kernel panic - not syncing: RCU Stall
[   40.057362] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.16.0-rc7 #1 PREEMPT_{RT,(full)}
[   40.057364] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[   40.057364] Call Trace:
[   40.057369]  <IRQ>
[   40.057370]  dump_stack_lvl+0x4d/0x70
[   40.057372]  panic+0x10a/0x2b9
[   40.057375]  ? try_to_wake_up+0x2f2/0x630
[   40.057377]  panic_on_rcu_stall.cold+0xc/0xc
[   40.057379]  rcu_sched_clock_irq.cold+0x15f/0x3db
[   40.057383]  ? __pfx_tick_nohz_handler+0x10/0x10
[   40.057385]  update_process_times+0x70/0xb0
[   40.057387]  tick_nohz_handler+0x8c/0x150
[   40.057391]  __hrtimer_run_queues+0x148/0x2e0
[   40.057394]  hrtimer_interrupt+0xf2/0x210
[   40.057397]  __sysvec_apic_timer_interrupt+0x53/0x100
[   40.057400]  sysvec_apic_timer_interrupt+0x66/0x80
[   40.057402]  </IRQ>
[   40.057403]  <TASK>
[   40.057403]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[   40.057405] RIP: 0010:pv_native_safe_halt+0xf/0x20
[   40.057407] Code: 76 79 00 e9 83 f5 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 35 68 21 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[   40.057408] RSP: 0018:ffffffff97a03e80 EFLAGS: 00000212
[   40.057409] RAX: ffff9269a3821000 RBX: ffffffff97a18900 RCX: 0000000000000001
[   40.057410] RDX: 4000000000000000 RSI: 0000000000000000 RDI: 00000000000759bc
[   40.057411] RBP: 0000000000000000 R08: 00000000000759bc R09: ffff92693bc24d90
[   40.057412] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   40.057413] R13: 0000000000000000 R14: ffffffff97a18038 R15: 000000007e0c1000
[   40.057415]  default_idle+0x9/0x10
[   40.057417]  default_idle_call+0x2b/0x100
[   40.057419]  do_idle+0x1d0/0x230
[   40.057420]  cpu_startup_entry+0x24/0x30
[   40.057422]  rest_init+0xbc/0xc0
[   40.057424]  start_kernel+0x6ca/0x6d0
[   40.057426]  x86_64_start_reservations+0x24/0x30
[   40.057428]  x86_64_start_kernel+0xc5/0xd0
[   40.057429]  common_startup_64+0x13e/0x148
[   40.057432]  </TASK>
[   40.057586] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
...


- Tested stress-ng with 1 worker process with group CPU limit set to 20%.
stress-ng metrics showed 20% CPU usage for the worker process and top showed
20% CPU usage increase for CPU 0, where the worker process was running.

[root@localhost ~]# systemd-run -p CPUQuota=20% stress-ng --cpu 1 --timeout 10s --metrics
Running as unit: run-rb98f1ee55a4e4c9dacb29774213a399c.service; invocation ID: 77c93909960347e09e916fac907f87c6
[root@localhost ~]# journalctl -f -u run-rb98f1ee55a4e4c9dacb29774213a399c.service
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: metrc: [1310] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: metrc: [1310]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: metrc: [1310] cpu                3181     10.02      2.01      0.01       317.50        1581.53        20.08          7504
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: info:  [1310] skipped: 0
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: info:  [1310] passed: 1: cpu (1)
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: info:  [1310] failed: 0
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: info:  [1310] metrics untrustworthy: 0
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: info:  [1310] successful run completed in 10.02 secs
Jul 28 16:19:31 localhost systemd[1]: run-rb98f1ee55a4e4c9dacb29774213a399c.service: Deactivated successfully.
Jul 28 16:19:31 localhost systemd[1]: run-rb98f1ee55a4e4c9dacb29774213a399c.service: Consumed 2.026s CPU time.


- Tested stress-ng with 2 worker processes with group CPU limit set to 20%.
Both processes ran on the same CPU core due to the systemd slice CPU affinity
settings (AllowedCPUs=0). stress-ng metrics showed 10% CPU usage per worker
process and top showed 20% usage increase for CPU 0, where both worker
processes were running.

[root@localhost ~]# systemd-run -p CPUQuota=20% stress-ng --cpu 2 --timeout 10s --metrics
Running as unit: run-rd616594713434ac9bb346faa92f7110a.service; invocation ID: f45acc85d19944cbbdf633f0c95091bb
[root@localhost ~]# journalctl -f -u run-rd616594713434ac9bb346faa92f7110a.service
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: metrc: [1373] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: metrc: [1373]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: metrc: [1373] cpu                3275     10.06      2.00      0.01       325.53        1630.74         9.98          7476
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: info:  [1373] skipped: 0
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: info:  [1373] passed: 2: cpu (2)
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: info:  [1373] failed: 0
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: info:  [1373] metrics untrustworthy: 0
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: info:  [1373] successful run completed in 10.06 secs
Jul 28 16:24:08 localhost systemd[1]: run-rd616594713434ac9bb346faa92f7110a.service: Deactivated successfully.
Jul 28 16:24:08 localhost systemd[1]: run-rd616594713434ac9bb346faa92f7110a.service: Consumed 2.023s CPU time.


- Tested stress-ng with 3 worker processes with group CPU limit set to 60%.
This time without CPU affinity settings, so each process ran on a different CPU
core. stress-ng metrics showed 20% CPU usage per worker process and top showed
20% usage increase per each CPU.

[root@localhost ~]# systemd-run -p CPUQuota=60% stress-ng --cpu 3 --timeout 10s --metrics
Running as unit: run-r19417007568a4c55a02817588bd2b32f.service; invocation ID: c09e104a497c4fdfb467c6744bf3923b
[root@localhost ~]# journalctl -f -u run-r19417007568a4c55a02817588bd2b32f.service
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: metrc: [1386] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: metrc: [1386]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: metrc: [1386] cpu                1974     10.08      6.04      0.01       195.93         326.04        20.03          6856
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: info:  [1386] skipped: 0
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: info:  [1386] passed: 3: cpu (3)
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: info:  [1386] failed: 0
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: info:  [1386] metrics untrustworthy: 0
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: info:  [1386] successful run completed in 10.08 secs
Jul 28 16:55:46 localhost systemd[1]: run-r19417007568a4c55a02817588bd2b32f.service: Deactivated successfully.
Jul 28 16:55:46 localhost systemd[1]: run-r19417007568a4c55a02817588bd2b32f.service: Consumed 6.096s CPU time.


- Tested stress-ng with 4 worker processes with group CPU limit set to 40%.
Also this time without CPU affinity settings, so each process ran on a
different CPU core. stress-ng metrics showed 10% CPU usage per worker process
and top showed 10% usage increase per each CPU.

[root@localhost ~]# systemd-run -p CPUQuota=40% stress-ng --cpu 4 --timeout 10s --metrics
Running as unit: run-r70a53f5333b948029f9739e80454648d.service; invocation ID: be510cc4c4e74676a9749c1758e65226
[root@localhost ~]# journalctl -f -u run-r70a53f5333b948029f9739e80454648d.service
Jul 28 16:58:33 localhost systemd[1]: Started /usr/bin/stress-ng --cpu 4 --timeout 10s --metrics.
Jul 28 16:58:33 localhost stress-ng[1420]: invoked with '/usr/bin/stress-ng --cpu 4 --timeout 10s --metrics' by user 0 'root'
Jul 28 16:58:33 localhost stress-ng[1420]: system: 'localhost' Linux 6.16.0-rc7-00005-g3113c41a2959 #1 SMP PREEMPT_RT Wed Jul 23 18:00:56 CEST 2025 x86_64
Jul 28 16:58:33 localhost stress-ng[1420]: stress-ng: info:  [1420] setting to a 10 secs run per stressor
Jul 28 16:58:33 localhost stress-ng[1420]: memory (MB): total 1973.16, free 1749.38, shared 10.99, buffer 7.56, swap 0.00, free swap 0.00
Jul 28 16:58:33 localhost stress-ng[1420]: stress-ng: info:  [1420] dispatching hogs: 4 cpu
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: metrc: [1420] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: metrc: [1420]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: metrc: [1420] cpu                1062     10.08      4.01      0.02       105.41         263.63        10.00          7276
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: info:  [1420] skipped: 0
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: info:  [1420] passed: 4: cpu (4)
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: info:  [1420] failed: 0
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: info:  [1420] metrics untrustworthy: 0
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: info:  [1420] successful run completed in 10.08 secs
Jul 28 16:58:43 localhost systemd[1]: run-r70a53f5333b948029f9739e80454648d.service: Deactivated successfully.
Jul 28 16:58:43 localhost systemd[1]: run-r70a53f5333b948029f9739e80454648d.service: Consumed 4.047s CPU time.


- Tested stress-ng with 4 worker processes with group CPU limit set to 200%.
Also this time without CPU affinity settings, so each process ran on a
different CPU core. stress-ng metrics showed 50% CPU usage per worker process
and top showed 50% usage increase per each CPU.

[root@localhost ~]# systemd-run -p CPUQuota=200% stress-ng --cpu 4 --timeout 10s --metrics
Running as unit: run-r887083cd168e4b3fa07672b09c3bb72d.service; invocation ID: 224b6544b79e449db43b42455700fddd
[root@localhost ~]# journalctl -f -u run-r887083cd168e4b3fa07672b09c3bb72d.service
Jul 28 17:03:44 localhost systemd[1]: Started /usr/bin/stress-ng --cpu 4 --timeout 10s --metrics.
Jul 28 17:03:44 localhost stress-ng[1169]: invoked with '/usr/bin/stress-ng --cpu 4 --timeout 10s --metrics' by user 0 'root'
Jul 28 17:03:44 localhost stress-ng[1169]: system: 'localhost' Linux 6.16.0-rc7-00005-g3113c41a2959 #1 SMP PREEMPT_RT Wed Jul 23 18:00:56 CEST 2025 x86_64
Jul 28 17:03:44 localhost stress-ng[1169]: stress-ng: info:  [1169] setting to a 10 secs run per stressor
Jul 28 17:03:44 localhost stress-ng[1169]: memory (MB): total 1973.15, free 1722.16, shared 10.26, buffer 7.33, swap 0.00, free swap 0.00
Jul 28 17:03:44 localhost stress-ng[1169]: stress-ng: info:  [1169] dispatching hogs: 4 cpu
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: metrc: [1169] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: metrc: [1169]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: metrc: [1169] cpu               21605     10.00     20.11      0.01      2160.25        1073.95        50.29          7424
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: info:  [1169] skipped: 0
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: info:  [1169] passed: 4: cpu (4)
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: info:  [1169] failed: 0
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: info:  [1169] metrics untrustworthy: 0
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: info:  [1169] successful run completed in 10.00 secs
Jul 28 17:03:54 localhost systemd[1]: run-r887083cd168e4b3fa07672b09c3bb72d.service: Deactivated successfully.
Jul 28 17:03:54 localhost systemd[1]: run-r887083cd168e4b3fa07672b09c3bb72d.service: Consumed 20.133s CPU time.

Best regards,
Matteo Martelli
Re: [PATCH v3 0/5] Defer throttle when task exits to user
Posted by Aaron Lu 2 months ago
Hi Matteo,

On Fri, Aug 01, 2025 at 04:31:25PM +0200, Matteo Martelli wrote:
... ... 
> I encountered this issue on a test image with both PREEMPT_RT and
> CFS_BANDWIDTH kernel options enabled. The test image is based on
> freedesktop-sdk (v24.08.10) [1] with custom system configurations on
> top, and it was being run on qemu x86_64 with 4 virtual CPU cores. One
> notable system configuration is having most of system services running
> on a systemd slice, restricted on a single CPU core (with AllowedCPUs
> systemd option) and using CFS throttling (with CPUQuota systemd option).
> With this configuration I encountered RCU stalls during boots, I think
> because of the increased probability given by multiple processes being
> spawned simultaneously on the same core. After the first RCU stall, the
> system becomes unresponsive and successive RCU stalls are detected
> periodically. This seems to match with the deadlock situation described
> in your cover letter. I could only reproduce RCU stalls with the
> combination of both PREEMPT_RT and CFS_BANDWIDTH enabled.
> 
> I previously already tested this patch set at v2 (RFC) [2] on top of
> kernel v6.14 and v6.15. I've now retested it at v3 on top of kernel
> v6.16-rc7. I could no longer reproduce RCU stalls in all cases with the
> patch set applied. More specifically, in the last test I ran, without
> patch set applied, I could reproduce 32 RCU stalls in 24 hours, about 1
> or 2 every hour. In this test the system was rebooting just after the
> first RCU stall occurrence (through panic_on_rcu_stall=1 and panic=5
> kernel cmdline arguments) or after 100 seconds if no RCU stall occurred.
> This means the system rebooted 854 times in 24 hours (about 3.7%
> reproducibility). You can see below two RCU stall instances. I could not
> reproduce any RCU stall with the same test after applying the patch set.
> I obtained similar results while testing the patch set at v2 (RFC)[1].
> Another possibly interesting note is that the original custom
> configuration was with the slice CPUQuota=150%, then I retested it with
> CPUQuota=80%. The issue was reproducible in both configurations, notably
> even with CPUQuota=150% that to my understanding should correspond to no
> CFS throttling due to the CPU affinity set to 1 core only.

Agree. With cpu affinity set to 1 core, 150% quota should never hit. But
from the test results, it seems quota is hit somehow because if quota is
not hit, this series should make no difference.

Maybe fire a bpftrace script and see if quota is actually hit? A
reference script is here:
https://lore.kernel.org/lkml/20250521115115.GB24746@bytedance/

> 
> I also ran some quick tests with stress-ng and systemd CPUQuota parameter to
> verify that CFS throttling was behaving as expected. See details below after
> RCU stall logs.

Thanks for all these tests. If I read them correctly, in all these
tests, CFS throttling worked as expected. Right?

> 
> I hope this is helpful information and I can provide additional details if
> needed.
> 

Yes it's very helpful.

> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
> 

Thanks!

> [1]: https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/releases/freedesktop-sdk-24.08.10
> [2]: https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@bytedance.com/
> 

I'll rebase this series after merge window for v6.17 is closed and
hopefully it's in good shape and maintainer will pick it up :)

Best regards,
Aaron
Re: [PATCH v3 0/5] Defer throttle when task exits to user
Posted by Valentin Schneider 2 months ago
On 04/08/25 15:52, Aaron Lu wrote:
> I'll rebase this series after merge window for v6.17 is closed and
> hopefully it's in good shape and maintainer will pick it up :)
>

FWIW I've had this buried in my todolist for too long, I'm bumping it up
and will do a proper review starting this week.
Re: [PATCH v3 0/5] Defer throttle when task exits to user
Posted by Aaron Lu 2 months ago
On Mon, Aug 04, 2025 at 01:18:05PM +0200, Valentin Schneider wrote:
> On 04/08/25 15:52, Aaron Lu wrote:
> > I'll rebase this series after merge window for v6.17 is closed and
> > hopefully it's in good shape and maintainer will pick it up :)
> >
> 
> FWIW I've had this buried in my todolist for too long, I'm bumping it up
> and will do a proper review starting this week.
>

Thanks Valentin.
It's great you can look at this, look forward to your comments.
Re: [PATCH v3 0/5] Defer throttle when task exits to user
Posted by Matteo Martelli 1 month, 4 weeks ago
Hi Aaron,

On Mon, 4 Aug 2025 15:52:04 +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> Hi Matteo,
> 
> On Fri, Aug 01, 2025 at 04:31:25PM +0200, Matteo Martelli wrote:
> ... ... 
> > I encountered this issue on a test image with both PREEMPT_RT and
> > CFS_BANDWIDTH kernel options enabled. The test image is based on
> > freedesktop-sdk (v24.08.10) [1] with custom system configurations on
> > top, and it was being run on qemu x86_64 with 4 virtual CPU cores. One
> > notable system configuration is having most of system services running
> > on a systemd slice, restricted on a single CPU core (with AllowedCPUs
> > systemd option) and using CFS throttling (with CPUQuota systemd option).
> > With this configuration I encountered RCU stalls during boots, I think
> > because of the increased probability given by multiple processes being
> > spawned simultaneously on the same core. After the first RCU stall, the
> > system becomes unresponsive and successive RCU stalls are detected
> > periodically. This seems to match with the deadlock situation described
> > in your cover letter. I could only reproduce RCU stalls with the
> > combination of both PREEMPT_RT and CFS_BANDWIDTH enabled.
> > 
> > I previously already tested this patch set at v2 (RFC) [2] on top of
> > kernel v6.14 and v6.15. I've now retested it at v3 on top of kernel
> > v6.16-rc7. I could no longer reproduce RCU stalls in all cases with the
> > patch set applied. More specifically, in the last test I ran, without
> > patch set applied, I could reproduce 32 RCU stalls in 24 hours, about 1
> > or 2 every hour. In this test the system was rebooting just after the
> > first RCU stall occurrence (through panic_on_rcu_stall=1 and panic=5
> > kernel cmdline arguments) or after 100 seconds if no RCU stall occurred.
> > This means the system rebooted 854 times in 24 hours (about 3.7%
> > reproducibility). You can see below two RCU stall instances. I could not
> > reproduce any RCU stall with the same test after applying the patch set.
> > I obtained similar results while testing the patch set at v2 (RFC)[1].
> > Another possibly interesting note is that the original custom
> > configuration was with the slice CPUQuota=150%, then I retested it with
> > CPUQuota=80%. The issue was reproducible in both configurations, notably
> > even with CPUQuota=150% that to my understanding should correspond to no
> > CFS throttling due to the CPU affinity set to 1 core only.
> 
> Agree. With cpu affinity set to 1 core, 150% quota should never hit. But
> from the test results, it seems quota is hit somehow because if quota is
> not hit, this series should make no difference.
> 
> Maybe fire a bpftrace script and see if quota is actually hit? A
> reference script is here:
> https://lore.kernel.org/lkml/20250521115115.GB24746@bytedance/
> 

I better looked into this and actually there was another slice
(user.slice) configured with CPUQuota=25%. Disabling the CPUQuota limit
on the first mentioned slice (system.slice) I could still reproduce the
RCU stalls. It looks like the throttling was happening during the first
login after boot as also shown by the following ftrace logs.

[   12.019263]  podman-user-gen-992     [000] dN.2.    12.023684: throttle_cfs_rq <-pick_task_fair
[   12.051074]          systemd-981     [000] dN.2.    12.055502: throttle_cfs_rq <-pick_task_fair
[   12.150067]          systemd-981     [000] dN.2.    12.154500: throttle_cfs_rq <-put_prev_entity
[   12.251448]          systemd-981     [000] dN.2.    12.255839: throttle_cfs_rq <-put_prev_entity
[   12.369867]     sshd-session-976     [000] dN.2.    12.374293: throttle_cfs_rq <-pick_task_fair
[   12.453080]             bash-1002    [000] dN.2.    12.457502: throttle_cfs_rq <-pick_task_fair
[   12.551279]             bash-1012    [000] dN.2.    12.555701: throttle_cfs_rq <-pick_task_fair
[   12.651085]           podman-998     [000] dN.2.    12.655505: throttle_cfs_rq <-pick_task_fair
[   12.750509]           podman-1001    [000] dN.2.    12.754931: throttle_cfs_rq <-put_prev_entity
[   12.868351]           podman-1030    [000] dN.2.    12.872780: throttle_cfs_rq <-put_prev_entity
[   12.961076]           podman-1033    [000] dN.2.    12.965504: throttle_cfs_rq <-put_prev_entity


By increasing the CPUQuota to 50% limit of the user.slice, the same test
mentioned in my previous email produced less RCU stalls and less
throttling events in the ftrace logs. Then by setting the user.slice to
100% I could no longer reproduce either RCU stalls or traced throttling
events.

> > I also ran some quick tests with stress-ng and systemd CPUQuota parameter to
> > verify that CFS throttling was behaving as expected. See details below after
> > RCU stall logs.
> 
> Thanks for all these tests. If I read them correctly, in all these
> tests, CFS throttling worked as expected. Right?
> 

Yes, correct.

> Best regards,
> Aaron
> 

Best regards,
Matteo Martelli
Re: [PATCH v3 0/5] Defer throttle when task exits to user
Posted by K Prateek Nayak 2 months ago
Hello Aaron,

On 7/15/2025 12:46 PM, Aaron Lu wrote:
> v3:
> - Keep throttled cfs_rq's PELT clock running as long as it still has
>   entity queued, suggested by Benjamin Segall. I've folded this change
>   into patch3;
> - Rebased on top of tip/sched/core, commit 2885daf47081
>   ("lib/smp_processor_id: Make migration check unconditional of SMP").
> 
> Hi Prateek,
> I've kept your tested-by tag(Thanks!) for v2 since I believe this pelt
> clock change should not affect things much, but let me know if you don't
> think that is appropriate.

I've officially tested this series so it should be fine :)

In addition to Jan's test, I also did some sanity test looking at PELT
and everything looks good for the simplest case - once busy loop inside
a cgroup that gets throttled. The per-task throttling behavior is
identical to the current behavior for this simplest case.

If I find time, I'll look into nested hierarchies with wakeups to see
if I can spot anything odd there. I don't really have a good control
setup to compare against here but so far I haven't found anything odd
and it works as intended.

> 
> Tests I've done:
> - Jan's rt deadlock reproducer[1]. Without this series, I saw rcu-stalls
>   within 2 minutes and with this series, I do not see rcu-stalls after
>   10 minutes.
> - A stress test that creates a lot of pressure on fork/exit path and
>   cgroup_threadgroup_rwsem. Without this series, the test will cause
>   task hung in about 5 minutes and with this series, no problem found
>   after several hours. Songtang wrote this test script and I've used it
>   to verify the patches, thanks Songtang.

I just noticed this script. I'll give this a spin too when I test
nested hierarchies.

-- 
Thanks and Regards,
Prateek
Re: [PATCH v3 0/5] Defer throttle when task exits to user
Posted by Aaron Lu 2 months ago
On Mon, Aug 04, 2025 at 02:21:30PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 7/15/2025 12:46 PM, Aaron Lu wrote:
> > v3:
> > - Keep throttled cfs_rq's PELT clock running as long as it still has
> >   entity queued, suggested by Benjamin Segall. I've folded this change
> >   into patch3;
> > - Rebased on top of tip/sched/core, commit 2885daf47081
> >   ("lib/smp_processor_id: Make migration check unconditional of SMP").
> > 
> > Hi Prateek,
> > I've kept your tested-by tag(Thanks!) for v2 since I believe this pelt
> > clock change should not affect things much, but let me know if you don't
> > think that is appropriate.
> 
> I've officially tested this series so it should be fine :)

Good to hear this :)

> 
> In addition to Jan's test, I also did some sanity test looking at PELT
> and everything looks good for the simplest case - once busy loop inside
> a cgroup that gets throttled. The per-task throttling behavior is
> identical to the current behavior for this simplest case.
> 
> If I find time, I'll look into nested hierarchies with wakeups to see
> if I can spot anything odd there. I don't really have a good control
> setup to compare against here but so far I haven't found anything odd
> and it works as intended.
> 

Thanks for all these tests.

Best regards,
Aaron

> > 
> > Tests I've done:
> > - Jan's rt deadlock reproducer[1]. Without this series, I saw rcu-stalls
> >   within 2 minutes and with this series, I do not see rcu-stalls after
> >   10 minutes.
> > - A stress test that creates a lot of pressure on fork/exit path and
> >   cgroup_threadgroup_rwsem. Without this series, the test will cause
> >   task hung in about 5 minutes and with this series, no problem found
> >   after several hours. Songtang wrote this test script and I've used it
> >   to verify the patches, thanks Songtang.
> 
> I just noticed this script. I'll give this a spin too when I test
> nested hierarchies.
> 
> -- 
> Thanks and Regards,
> Prateek
Re: [PATCH v3 0/5] Defer throttle when task exits to user
Posted by Valentin Schneider 1 month, 1 week ago
On 15/07/25 15:16, Aaron Lu wrote:
> There are consequences because of this new throttle model, e.g. for a
> cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> return2user path, one task still running in kernel mode, this cfs_rq is
> in a partial throttled state:
> - Should its pelt clock be frozen?
> - Should this state be accounted into throttled_time?
>
> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.

OK, I finally got to testing the PELT side of things :-)

I shoved a bunch of periodic tasks in a CPU cgroup with quite low limits
(1ms runtime, 10ms period); I looked at the _avg values using the
trace_pelt* tracepoints.

Overall there isn't much change to the averages themselves. There are more
updates since the tasks are genuinely dequeued/enqueued during a throttle
cycle, but that's expected.

I'll wait for your next version, but you can have:

Tested-by: Valentin Schneider <vschneid@redhat.com>
Re: [PATCH v3 0/5] Defer throttle when task exits to user
Posted by Aaron Lu 2 months, 3 weeks ago
On Tue, Jul 15, 2025 at 03:16:53PM +0800, Aaron Lu wrote:
> - A stress test that creates a lot of pressure on fork/exit path and
>   cgroup_threadgroup_rwsem. Without this series, the test will cause
>   task hung in about 5 minutes and with this series, no problem found
>   after several hours. Songtang wrote this test script and I've used it
>   to verify the patches, thanks Songtang.

Test scripts attached.