include/linux/sched.h | 4 + kernel/sched/core.c | 3 + kernel/sched/fair.c | 449 ++++++++++++++++++++++-------------------- kernel/sched/sched.h | 7 + 4 files changed, 248 insertions(+), 215 deletions(-)
This is a continuous work based on Valentin Schneider's posting here:
Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
Valentin has described the problem very well in the above link. We also
have task hung problem from time to time in our environment due to cfs quota.
It is mostly visible with rwsem: when a reader is throttled, writer comes in
and has to wait, the writer also makes all subsequent readers wait,
causing problems of priority inversion or even whole system hung.
To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, mark its throttled status but do not
remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
when they get picked, add a task work to them so that when they return
to user, they can be dequeued. In this way, tasks throttled will not
hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
those throttled tasks.
There are consequences because of this new throttle model, e.g. for a
cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
return2user path, one task still running in kernel mode, this cfs_rq is
in a partial throttled state:
- Should its pelt clock be frozen?
- Should this state be accounted into throttled_time?
For pelt clock, I chose to keep the current behavior to freeze it on
cfs_rq's throttle time. The assumption is that tasks running in kernel
mode should not last too long, freezing the cfs_rq's pelt clock can keep
its load and its corresponding sched_entity's weight. Hopefully, this can
result in a stable situation for the remaining running tasks to quickly
finish their jobs in kernel mode.
For throttle time accounting, I can see several possibilities:
- Similar to current behavior: starts accounting when cfs_rq gets
throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq
gets unthrottled. This has one drawback, e.g. if this cfs_rq has one
task when it gets throttled and eventually, that task doesn't return
to user but blocks, then this cfs_rq has no tasks on throttled list
but time is accounted as throttled; Patch2 and patch3 implements this
accounting(simple, fewer code change).
- Starts accounting when the throttled cfs_rq has at least one task on
its throttled list; stops accounting when it's unthrottled. This kind
of over accounts throttled time because partial throttle state is
accounted.
- Starts accounting when the throttled cfs_rq has no tasks left and its
throttled list is not empty; stops accounting when this cfs_rq is
unthrottled; This kind of under accounts throttled time because partial
throttle state is not accounted. Patch7 implements this accounting.
I do not have a strong feeling which accounting is the best, it's open
for discussion.
There is also the concern of increased duration of (un)throttle operations
in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
setup on a 2sockets/384cpus AMD server, the longest duration of
distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
For throttle path, with Chengming's suggestion to move "task work setup"
from throttle time to pick time, it's not an issue anymore.
Patches:
Patch1 is preparation work;
Patch2-3 provide the main functionality.
Patch2 deals with throttle path: when a cfs_rq is to be throttled, mark
throttled status for this cfs_rq and when tasks in throttled hierarchy
gets picked, add a task work to them so that when those tasks return to
user space, the task work can throttle it by dequeuing the task and
remember this by adding the task to its cfs_rq's limbo list;
Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
enqueue back those tasks in limbo list;
Patch4 deals with the dequeue path when task changes group, sched class
etc. Task that is throttled is dequeued in fair, but task->on_rq is
still set so when it changes task group or sched class or has affinity
setting change, core will firstly dequeue it. But since this task is
already dequeued in fair class, this patch handle this situation.
Patch5-6 are clean ups. Some code are obsolete after switching to task
based throttle mechanism.
Patch7 implements an alternative accounting mechanism for task based
throttle.
Changes since v1:
- Move "add task work" from throttle time to pick time, suggested by
Chengming Zhou;
- Use scope_gard() and cond_resched_tasks_rcu_qs() in
throttle_cfs_rq_work(), suggested by K Prateek Nayak;
- Remove now obsolete throttled_lb_pair(), suggested by K Prateek Nayak;
- Fix cfs_rq->runtime_remaining condition check in unthrottle_cfs_rq(),
suggested by K Prateek Nayak;
- Fix h_nr_runnable accounting for delayed dequeue case when task based
throttle is in use;
- Implemented an alternative way of throttle time accounting for
discussion purpose;
- Make !CONFIG_CFS_BANDWIDTH build.
I hope I didn't omit any feedbacks I've received, but feel free to let me
know if I did.
As in v1, all change logs are written by me and if they read bad, it's
my fault.
Comments are welcome.
Base commit: tip/sched/core, commit 6432e163ba1b("sched/isolation: Make
use of more than one housekeeping cpu").
Aaron Lu (4):
sched/fair: Take care of group/affinity/sched_class change for
throttled task
sched/fair: get rid of throttled_lb_pair()
sched/fair: fix h_nr_runnable accounting with per-task throttle
sched/fair: alternative way of accounting throttle time
Valentin Schneider (3):
sched/fair: Add related data structure for task based throttle
sched/fair: Handle throttle path for task based throttle
sched/fair: Handle unthrottle path for task based throttle
include/linux/sched.h | 4 +
kernel/sched/core.c | 3 +
kernel/sched/fair.c | 449 ++++++++++++++++++++++--------------------
kernel/sched/sched.h | 7 +
4 files changed, 248 insertions(+), 215 deletions(-)
--
2.39.5
Hi Aaron, Hi Valentin, On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote: > This is a continuous work based on Valentin Schneider's posting here: > Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry > https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/ > > Valentin has described the problem very well in the above link. We also > have task hung problem from time to time in our environment due to cfs quota. > It is mostly visible with rwsem: when a reader is throttled, writer comes in > and has to wait, the writer also makes all subsequent readers wait, > causing problems of priority inversion or even whole system hung. for testing purposes I backported this series to 6.14. We're currently hunting for a sporadic bug with PREEMPT_RT enabled. We see RCU stalls and complete system freezes after a couple of days with some container workload deployed. See [1]. It's too early to report "success", but this series seems to fix the circular dependency / system hang. Testing is still ongoing. While backporting I noticed some minor code style "issues". I will post them afterwards. Feel free to ignore... Best regards, Florian [1] https://lore.kernel.org/linux-rt-users/20250409135720.YuroItHp@linutronix.de/T/#t
Hi Florian, On Mon, Apr 14, 2025 at 10:54:48AM +0200, Florian Bezdeka wrote: > Hi Aaron, Hi Valentin, > > On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote: > > This is a continuous work based on Valentin Schneider's posting here: > > Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry > > https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/ > > > > Valentin has described the problem very well in the above link. We also > > have task hung problem from time to time in our environment due to cfs quota. > > It is mostly visible with rwsem: when a reader is throttled, writer comes in > > and has to wait, the writer also makes all subsequent readers wait, > > causing problems of priority inversion or even whole system hung. > > for testing purposes I backported this series to 6.14. We're currently > hunting for a sporadic bug with PREEMPT_RT enabled. We see RCU stalls > and complete system freezes after a couple of days with some container > workload deployed. See [1]. I tried to make a setup last week to reproduce the RT/cfs throttle deadlock issue Valentin described but haven't succeeded yet... > It's too early to report "success", but this series seems to fix the > circular dependency / system hang. Testing is still ongoing. Good to know this and thanks for giving it a try. > While backporting I noticed some minor code style "issues". I will post > them afterwards. Feel free to ignore... Your comments are welcome. Best regards, Aaron > > [1] https://lore.kernel.org/linux-rt-users/20250409135720.YuroItHp@linutronix.de/T/#t
On 14.04.25 14:04, Aaron Lu wrote: > Hi Florian, > > On Mon, Apr 14, 2025 at 10:54:48AM +0200, Florian Bezdeka wrote: >> Hi Aaron, Hi Valentin, >> >> On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote: >>> This is a continuous work based on Valentin Schneider's posting here: >>> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry >>> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/ >>> >>> Valentin has described the problem very well in the above link. We also >>> have task hung problem from time to time in our environment due to cfs quota. >>> It is mostly visible with rwsem: when a reader is throttled, writer comes in >>> and has to wait, the writer also makes all subsequent readers wait, >>> causing problems of priority inversion or even whole system hung. >> >> for testing purposes I backported this series to 6.14. We're currently >> hunting for a sporadic bug with PREEMPT_RT enabled. We see RCU stalls >> and complete system freezes after a couple of days with some container >> workload deployed. See [1]. > > I tried to make a setup last week to reproduce the RT/cfs throttle > deadlock issue Valentin described but haven't succeeded yet... > Attached the bits with which we succeeded, sometimes. Setup: Debian 12, RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h patience. As we have to succeed with at least 3 race conditions in a row, that is still not bad... But maybe someone has an idea how to increase probabilities further. Jan -- Siemens AG, Foundational Technologies Linux Expert Center
Hello Jan, On 4/15/2025 10:59 AM, Jan Kiszka wrote: > On 14.04.25 14:04, Aaron Lu wrote: >> Hi Florian, >> >> On Mon, Apr 14, 2025 at 10:54:48AM +0200, Florian Bezdeka wrote: >>> Hi Aaron, Hi Valentin, >>> >>> On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote: >>>> This is a continuous work based on Valentin Schneider's posting here: >>>> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry >>>> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/ >>>> >>>> Valentin has described the problem very well in the above link. We also >>>> have task hung problem from time to time in our environment due to cfs quota. >>>> It is mostly visible with rwsem: when a reader is throttled, writer comes in >>>> and has to wait, the writer also makes all subsequent readers wait, >>>> causing problems of priority inversion or even whole system hung. >>> >>> for testing purposes I backported this series to 6.14. We're currently >>> hunting for a sporadic bug with PREEMPT_RT enabled. We see RCU stalls >>> and complete system freezes after a couple of days with some container >>> workload deployed. See [1]. >> >> I tried to make a setup last week to reproduce the RT/cfs throttle >> deadlock issue Valentin described but haven't succeeded yet... >> > > Attached the bits with which we succeeded, sometimes. Setup: Debian 12, > RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h > patience. As we have to succeed with at least 3 race conditions in a > row, that is still not bad... But maybe someone has an idea how to > increase probabilities further. Looking at run.sh, there are only fair tasks with one of them being run with cfs bandwidth constraints. Are you saying something goes wrong on PREEMPT_RT as a result of using bandwidth control on fair tasks? What exactly is the symptom you are observing? Does one of the assert() trip during the run? Do you see a stall logged on dmesg? Can you provide more information on what to expect in this 2min - 2hr window? Additionally, do you have RT throttling enabled in your setup? Can long running RT tasks starve fair tasks on your setup? > > Jan > -- Thanks and Regards, Prateek
On 15.04.25 08:05, K Prateek Nayak wrote:
> Hello Jan,
>
> On 4/15/2025 10:59 AM, Jan Kiszka wrote:
>> On 14.04.25 14:04, Aaron Lu wrote:
>>> Hi Florian,
>>>
>>> On Mon, Apr 14, 2025 at 10:54:48AM +0200, Florian Bezdeka wrote:
>>>> Hi Aaron, Hi Valentin,
>>>>
>>>> On Wed, 2025-04-09 at 20:07 +0800, Aaron Lu wrote:
>>>>> This is a continuous work based on Valentin Schneider's posting here:
>>>>> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to
>>>>> user entry
>>>>> https://lore.kernel.org/lkml/20240711130004.2157737-1-
>>>>> vschneid@redhat.com/
>>>>>
>>>>> Valentin has described the problem very well in the above link. We
>>>>> also
>>>>> have task hung problem from time to time in our environment due to
>>>>> cfs quota.
>>>>> It is mostly visible with rwsem: when a reader is throttled, writer
>>>>> comes in
>>>>> and has to wait, the writer also makes all subsequent readers wait,
>>>>> causing problems of priority inversion or even whole system hung.
>>>>
>>>> for testing purposes I backported this series to 6.14. We're currently
>>>> hunting for a sporadic bug with PREEMPT_RT enabled. We see RCU stalls
>>>> and complete system freezes after a couple of days with some container
>>>> workload deployed. See [1].
>>>
>>> I tried to make a setup last week to reproduce the RT/cfs throttle
>>> deadlock issue Valentin described but haven't succeeded yet...
>>>
>>
>> Attached the bits with which we succeeded, sometimes. Setup: Debian 12,
>> RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h
>> patience. As we have to succeed with at least 3 race conditions in a
>> row, that is still not bad... But maybe someone has an idea how to
>> increase probabilities further.
>
> Looking at run.sh, there are only fair tasks with one of them being run
> with cfs bandwidth constraints. Are you saying something goes wrong on
> PREEMPT_RT as a result of using bandwidth control on fair tasks?
Yes, exactly. Also our in-field workload that triggers (most likely)
this issue is not using RT tasks itself. Only kernel threads are RT here.
>
> What exactly is the symptom you are observing? Does one of the assert()
> trip during the run? Do you see a stall logged on dmesg? Can you provide
> more information on what to expect in this 2min - 2hr window?
I've just lost my traces from yesterday ("you have 0 minutes to find a
power adapter"), but I got nice RCU stall warnings in the VM, including
backtraces from the involved tasks (minus the read-lock holder IIRC).
Maybe Florian can drop one of his dumps.
>
> Additionally, do you have RT throttling enabled in your setup? Can long
> running RT tasks starve fair tasks on your setup?
RT throttling is enabled (default settings) but was not kicking in - why
should it in that scenario? The only RT thread, ktimers, ran into the
held lock and stopped.
Jan
--
Siemens AG, Foundational Technologies
Linux Expert Center
(+ Sebastian)
Hello Jan,
On 4/15/2025 11:39 AM, Jan Kiszka wrote:
>>> Attached the bits with which we succeeded, sometimes. Setup: Debian 12,
>>> RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h
>>> patience. As we have to succeed with at least 3 race conditions in a
>>> row, that is still not bad... But maybe someone has an idea how to
>>> increase probabilities further.
>>
>> Looking at run.sh, there are only fair tasks with one of them being run
>> with cfs bandwidth constraints. Are you saying something goes wrong on
>> PREEMPT_RT as a result of using bandwidth control on fair tasks?
>
> Yes, exactly. Also our in-field workload that triggers (most likely)
> this issue is not using RT tasks itself. Only kernel threads are RT here.
>
>>
>> What exactly is the symptom you are observing? Does one of the assert()
>> trip during the run? Do you see a stall logged on dmesg? Can you provide
>> more information on what to expect in this 2min - 2hr window?
>
> I've just lost my traces from yesterday ("you have 0 minutes to find a
> power adapter"), but I got nice RCU stall warnings in the VM, including
> backtraces from the involved tasks (minus the read-lock holder IIRC).
> Maybe Florian can drop one of his dumps.
So I ran your reproducer on a 2vCPU VM running v6.15-rc1 PREEMPT_RT
and I saw:
rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 0-...!: (15000 ticks this GP) idle=8a74/0/0x1 softirq=0/0 fqs=0
rcu: (t=15001 jiffies g=12713 q=24 ncpus=2)
rcu: rcu_preempt kthread timer wakeup didn't happen for 15000 jiffies! g12713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
rcu: Possible timer handling issue on cpu=0 timer-softirq=17688
rcu: rcu_preempt kthread starved for 15001 jiffies! g12713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
rcu: RCU grace-period kthread stack dump:
task:rcu_preempt state:I stack:0 pid:17 tgid:17 ppid:2 task_flags:0x208040 flags:0x00004000
Call Trace:
<TASK>
__schedule+0x401/0x15a0
? srso_alias_return_thunk+0x5/0xfbef5
? lock_timer_base+0x77/0xb0
? srso_alias_return_thunk+0x5/0xfbef5
? __pfx_rcu_gp_kthread+0x10/0x10
schedule+0x27/0xd0
schedule_timeout+0x76/0x100
? __pfx_process_timeout+0x10/0x10
rcu_gp_fqs_loop+0x10a/0x4b0
rcu_gp_kthread+0xd3/0x160
kthread+0xff/0x210
? rt_spin_lock+0x3c/0xc0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.15.0-rc1-test-dirty #746 PREEMPT_{RT,(full)}
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:pv_native_safe_halt+0xf/0x20
Code: 22 df e9 1f 08 e5 fe 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 85 96 15 00 fb f4 <e9> f7 07 e5 fe 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
RSP: 0018:ffffffff95803e50 EFLAGS: 00000216
RAX: ffff8e2d61534000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000081f8a6c
RBP: ffffffff9581d280 R08: 0000000000000000 R09: ffff8e2cf7d32301
R10: ffff8e2be11ae5c8 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000147b0
FS: 0000000000000000(0000) GS:ffff8e2d61534000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055e77c3a5128 CR3: 000000010ff78003 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
default_idle+0x9/0x20
default_idle_call+0x30/0x100
do_idle+0x20f/0x250
? do_idle+0xb/0x250
cpu_startup_entry+0x29/0x30
rest_init+0xde/0x100
start_kernel+0x733/0xb20
? copy_bootdata+0x9/0xb0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xba/0x110
common_startup_64+0x13e/0x141
</TASK>
Is this in line with what you are seeing?
>
>>
>> Additionally, do you have RT throttling enabled in your setup? Can long
>> running RT tasks starve fair tasks on your setup?
>
> RT throttling is enabled (default settings) but was not kicking in - why
> should it in that scenario? The only RT thread, ktimers, ran into the
> held lock and stopped.
>
> Jan
>
--
Thanks and Regards,
Prateek
On 4/15/2025 2:15 PM, K Prateek Nayak wrote:
> (+ Sebastian)
>
> Hello Jan,
>
> On 4/15/2025 11:39 AM, Jan Kiszka wrote:
>>>> Attached the bits with which we succeeded, sometimes. Setup: Debian 12,
>>>> RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h
To improve the reproducibility, I pinned the two tasks to the same CPU
as the bandwidth timer and I could hit this consistently within a few
minutes at the most.
>>>> patience. As we have to succeed with at least 3 race conditions in a
>>>> row, that is still not bad... But maybe someone has an idea how to
>>>> increase probabilities further.
>>>
>>> Looking at run.sh, there are only fair tasks with one of them being run
>>> with cfs bandwidth constraints. Are you saying something goes wrong on
>>> PREEMPT_RT as a result of using bandwidth control on fair tasks?
>>
>> Yes, exactly. Also our in-field workload that triggers (most likely)
>> this issue is not using RT tasks itself. Only kernel threads are RT here.
>>
>>>
>>> What exactly is the symptom you are observing? Does one of the assert()
>>> trip during the run? Do you see a stall logged on dmesg? Can you provide
>>> more information on what to expect in this 2min - 2hr window?
>>
>> I've just lost my traces from yesterday ("you have 0 minutes to find a
>> power adapter"), but I got nice RCU stall warnings in the VM, including
>> backtraces from the involved tasks (minus the read-lock holder IIRC).
>> Maybe Florian can drop one of his dumps.
>
> So I ran your reproducer on a 2vCPU VM running v6.15-rc1 PREEMPT_RT
> and I saw:
>
> rcu: INFO: rcu_preempt self-detected stall on CPU
> rcu: 0-...!: (15000 ticks this GP) idle=8a74/0/0x1 softirq=0/0 fqs=0
> rcu: (t=15001 jiffies g=12713 q=24 ncpus=2)
> rcu: rcu_preempt kthread timer wakeup didn't happen for 15000 jiffies! g12713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> rcu: Possible timer handling issue on cpu=0 timer-softirq=17688
> rcu: rcu_preempt kthread starved for 15001 jiffies! g12713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
> rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
> rcu: RCU grace-period kthread stack dump:
> task:rcu_preempt state:I stack:0 pid:17 tgid:17 ppid:2 task_flags:0x208040 flags:0x00004000
> Call Trace:
> <TASK>
> __schedule+0x401/0x15a0
> ? srso_alias_return_thunk+0x5/0xfbef5
> ? lock_timer_base+0x77/0xb0
> ? srso_alias_return_thunk+0x5/0xfbef5
> ? __pfx_rcu_gp_kthread+0x10/0x10
> schedule+0x27/0xd0
> schedule_timeout+0x76/0x100
> ? __pfx_process_timeout+0x10/0x10
> rcu_gp_fqs_loop+0x10a/0x4b0
> rcu_gp_kthread+0xd3/0x160
> kthread+0xff/0x210
> ? rt_spin_lock+0x3c/0xc0
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x34/0x50
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.15.0-rc1-test-dirty #746 PREEMPT_{RT,(full)}
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> RIP: 0010:pv_native_safe_halt+0xf/0x20
> Code: 22 df e9 1f 08 e5 fe 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 85 96 15 00 fb f4 <e9> f7 07 e5 fe 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> RSP: 0018:ffffffff95803e50 EFLAGS: 00000216
> RAX: ffff8e2d61534000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000081f8a6c
> RBP: ffffffff9581d280 R08: 0000000000000000 R09: ffff8e2cf7d32301
> R10: ffff8e2be11ae5c8 R11: 0000000000000001 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000147b0
> FS: 0000000000000000(0000) GS:ffff8e2d61534000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055e77c3a5128 CR3: 000000010ff78003 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
> <TASK>
> default_idle+0x9/0x20
> default_idle_call+0x30/0x100
> do_idle+0x20f/0x250
> ? do_idle+0xb/0x250
> cpu_startup_entry+0x29/0x30
> rest_init+0xde/0x100
> start_kernel+0x733/0xb20
> ? copy_bootdata+0x9/0xb0
> x86_64_start_reservations+0x18/0x30
> x86_64_start_kernel+0xba/0x110
> common_startup_64+0x13e/0x141
> </TASK>
>
> Is this in line with what you are seeing?
These are the backtrace for timer and the individual
epoll-stall threads:
[ 539.155042] task:ktimers/1 state:D stack:0 pid:31 tgid:31 ppid:2 task_flags:0x4208040 flags:0x00004000
[ 539.155047] Call Trace:
[ 539.155049] <TASK>
[ 539.155051] __schedule+0x401/0x15a0
[ 539.155055] ? srso_alias_return_thunk+0x5/0xfbef5
[ 539.155059] ? propagate_entity_cfs_rq+0x115/0x290
[ 539.155063] ? srso_alias_return_thunk+0x5/0xfbef5
[ 539.155067] ? srso_alias_return_thunk+0x5/0xfbef5
[ 539.155070] ? rt_mutex_setprio+0x1c2/0x480
[ 539.155075] schedule_rtlock+0x1e/0x40
[ 539.155078] rtlock_slowlock_locked+0x20e/0xc60
[ 539.155088] ? srso_alias_return_thunk+0x5/0xfbef5
[ 539.155093] rt_read_lock+0x8f/0x190
[ 539.155099] ep_poll_callback+0x37/0x2b0
[ 539.155105] __wake_up_common+0x78/0xa0
[ 539.155110] timerfd_tmrproc+0x43/0x60
[ 539.155114] ? __pfx_timerfd_tmrproc+0x10/0x10
[ 539.155116] __hrtimer_run_queues+0xfd/0x2e0
[ 539.155124] hrtimer_run_softirq+0x9d/0xf0
[ 539.155128] handle_softirqs.constprop.0+0xc1/0x2a0
[ 539.155134] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 539.155139] run_ktimerd+0x3e/0x80
[ 539.155142] smpboot_thread_fn+0xf3/0x220
[ 539.155147] kthread+0xff/0x210
[ 539.155151] ? rt_spin_lock+0x3c/0xc0
[ 539.155155] ? __pfx_kthread+0x10/0x10
[ 539.155159] ret_from_fork+0x34/0x50
[ 539.155165] ? __pfx_kthread+0x10/0x10
[ 539.155168] ret_from_fork_asm+0x1a/0x30
[ 539.155176] </TASK>
[ 557.323846] task:epoll-stall state:D stack:0 pid:885 tgid:885 ppid:1 task_flags:0x400000 flags:0x00004002
[ 557.323848] Call Trace:
[ 557.323849] <TASK>
[ 557.323851] __schedule+0x401/0x15a0
[ 557.323853] ? rt_write_lock+0x108/0x260
[ 557.323858] schedule_rtlock+0x1e/0x40
[ 557.323860] rt_write_lock+0xaa/0x260
[ 557.323864] do_epoll_wait+0x21f/0x4a0
[ 557.323869] ? __pfx_ep_autoremove_wake_function+0x10/0x10
[ 557.323872] __x64_sys_epoll_wait+0x63/0x100
[ 557.323876] do_syscall_64+0x6f/0x120
[ 557.323879] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323881] ? ksys_read+0x6b/0xe0
[ 557.323883] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323885] ? syscall_exit_to_user_mode+0x51/0x1a0
[ 557.323887] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323889] ? do_syscall_64+0x7b/0x120
[ 557.323890] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323892] ? ep_send_events+0x26d/0x2b0
[ 557.323896] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323898] ? do_epoll_wait+0x17e/0x4a0
[ 557.323900] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323902] ? __rseq_handle_notify_resume+0xa7/0x500
[ 557.323905] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323907] ? aa_file_perm+0x123/0x4e0
[ 557.323911] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323912] ? get_nohz_timer_target+0x2a/0x180
[ 557.323914] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323916] ? _copy_to_iter+0xa3/0x630
[ 557.323920] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323922] ? timerqueue_add+0x6a/0xc0
[ 557.323924] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323926] ? hrtimer_start_range_ns+0x2e7/0x4a0
[ 557.323931] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323932] ? timerfd_read_iter+0x141/0x2b0
[ 557.323934] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323936] ? security_file_permission+0x123/0x140
[ 557.323940] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323942] ? vfs_read+0x264/0x340
[ 557.323946] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323948] ? ksys_read+0x6b/0xe0
[ 557.323950] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323951] ? syscall_exit_to_user_mode+0x51/0x1a0
[ 557.323953] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323955] ? do_syscall_64+0x7b/0x120
[ 557.323957] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323958] ? syscall_exit_to_user_mode+0x168/0x1a0
[ 557.323960] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323962] ? do_syscall_64+0x7b/0x120
[ 557.323963] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.323965] ? do_syscall_64+0x7b/0x120
[ 557.323967] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 557.323968] RIP: 0033:0x7f0371fd3dea
[ 557.323970] RSP: 002b:00007ffd1062cd68 EFLAGS: 00000246 ORIG_RAX: 00000000000000e8
[ 557.323971] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0371fd3dea
[ 557.323972] RDX: 0000000000000001 RSI: 00007ffd1062cda4 RDI: 0000000000000005
[ 557.323973] RBP: 00007ffd1062ce00 R08: 0000000000000000 R09: 000055ea1d8f12a0
[ 557.323974] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007ffd1062cf18
[ 557.323975] R13: 000055ea03253249 R14: 000055ea03255d80 R15: 00007f0372121040
[ 557.323979] </TASK>
[ 557.431402] task:epoll-stall-wri state:R running task stack:0 pid:887 tgid:887 ppid:1 task_flags:0x400100 flags:0x00000002
[ 557.431405] Call Trace:
[ 557.431406] <TASK>
[ 557.431408] __schedule+0x401/0x15a0
[ 557.431410] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431412] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431414] ? psi_group_change+0x212/0x460
[ 557.431417] ? pick_eevdf+0x71/0x180
[ 557.431419] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431421] ? update_curr+0x8d/0x240
[ 557.431425] preempt_schedule+0x41/0x60
[ 557.431427] preempt_schedule_thunk+0x16/0x30
[ 557.431431] try_to_wake_up+0x2f6/0x6e0
[ 557.431433] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431437] ep_autoremove_wake_function+0x12/0x40
[ 557.431439] __wake_up_common+0x78/0xa0
[ 557.431443] __wake_up_sync+0x34/0x50
[ 557.431445] ep_poll_callback+0x13e/0x2b0
[ 557.431448] ? aa_file_perm+0x123/0x4e0
[ 557.431451] __wake_up_common+0x78/0xa0
[ 557.431454] __wake_up_sync_key+0x38/0x50
[ 557.431456] anon_pipe_write+0x43b/0x6d0
[ 557.431461] fifo_pipe_write+0x13/0xe0
[ 557.431463] vfs_write+0x374/0x420
[ 557.431468] ksys_write+0xc9/0xe0
[ 557.431471] do_syscall_64+0x6f/0x120
[ 557.431473] ? current_time+0x30/0x130
[ 557.431476] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431479] ? vfs_write+0x1bd/0x420
[ 557.431481] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431483] ? vfs_write+0x1bd/0x420
[ 557.431487] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431489] ? ksys_write+0xc9/0xe0
[ 557.431491] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431492] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431494] ? syscall_exit_to_user_mode+0x51/0x1a0
[ 557.431496] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431498] ? do_syscall_64+0x7b/0x120
[ 557.431501] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431502] ? ksys_write+0xc9/0xe0
[ 557.431504] ? do_syscall_64+0x7b/0x120
[ 557.431506] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431508] ? syscall_exit_to_user_mode+0x51/0x1a0
[ 557.431510] ? srso_alias_return_thunk+0x5/0xfbef5
[ 557.431511] ? do_syscall_64+0x7b/0x120
[ 557.431513] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 557.431515] RIP: 0033:0x7f0ef191c887
[ 557.431516] RSP: 002b:00007ffc15f50948 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 557.431517] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0ef191c887
[ 557.431518] RDX: 0000000000000001 RSI: 0000561a16a0203d RDI: 0000000000000003
[ 557.431519] RBP: 00007ffc15f50970 R08: 0000000000000065 R09: 0000561a45f0e2a0
[ 557.431520] R10: 0000000000000077 R11: 0000000000000246 R12: 00007ffc15f50a88
[ 557.431521] R13: 0000561a16a011a9 R14: 0000561a16a03da8 R15: 00007f0ef1a7b040
[ 557.431525] </TASK>
>
>>
>>>
>>> Additionally, do you have RT throttling enabled in your setup? Can long
>>> running RT tasks starve fair tasks on your setup?
>>
>> RT throttling is enabled (default settings) but was not kicking in - why
>> should it in that scenario? The only RT thread, ktimers, ran into the
>> held lock and stopped.
>>
>> Jan
>>
>
--
Thanks and Regards,
Prateek
On 15.04.25 10:45, K Prateek Nayak wrote:
> (+ Sebastian)
>
> Hello Jan,
>
> On 4/15/2025 11:39 AM, Jan Kiszka wrote:
>>>> Attached the bits with which we succeeded, sometimes. Setup: Debian 12,
>>>> RT kernel, 2-4 cores VM, 1-5 instances of the test, 2 min - 2 h
>>>> patience. As we have to succeed with at least 3 race conditions in a
>>>> row, that is still not bad... But maybe someone has an idea how to
>>>> increase probabilities further.
>>>
>>> Looking at run.sh, there are only fair tasks with one of them being run
>>> with cfs bandwidth constraints. Are you saying something goes wrong on
>>> PREEMPT_RT as a result of using bandwidth control on fair tasks?
>>
>> Yes, exactly. Also our in-field workload that triggers (most likely)
>> this issue is not using RT tasks itself. Only kernel threads are RT here.
>>
>>>
>>> What exactly is the symptom you are observing? Does one of the assert()
>>> trip during the run? Do you see a stall logged on dmesg? Can you provide
>>> more information on what to expect in this 2min - 2hr window?
>>
>> I've just lost my traces from yesterday ("you have 0 minutes to find a
>> power adapter"), but I got nice RCU stall warnings in the VM, including
>> backtraces from the involved tasks (minus the read-lock holder IIRC).
>> Maybe Florian can drop one of his dumps.
>
> So I ran your reproducer on a 2vCPU VM running v6.15-rc1 PREEMPT_RT
> and I saw:
>
> rcu: INFO: rcu_preempt self-detected stall on CPU
> rcu: 0-...!: (15000 ticks this GP) idle=8a74/0/0x1 softirq=0/0
> fqs=0
> rcu: (t=15001 jiffies g=12713 q=24 ncpus=2)
> rcu: rcu_preempt kthread timer wakeup didn't happen for 15000
> jiffies! g12713 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> rcu: Possible timer handling issue on cpu=0 timer-softirq=17688
> rcu: rcu_preempt kthread starved for 15001 jiffies! g12713 f0x0
> RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
> rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is
> now expected behavior.
> rcu: RCU grace-period kthread stack dump:
> task:rcu_preempt state:I stack:0 pid:17 tgid:17
> ppid:2 task_flags:0x208040 flags:0x00004000
> Call Trace:
> <TASK>
> __schedule+0x401/0x15a0
> ? srso_alias_return_thunk+0x5/0xfbef5
> ? lock_timer_base+0x77/0xb0
> ? srso_alias_return_thunk+0x5/0xfbef5
> ? __pfx_rcu_gp_kthread+0x10/0x10
> schedule+0x27/0xd0
> schedule_timeout+0x76/0x100
> ? __pfx_process_timeout+0x10/0x10
> rcu_gp_fqs_loop+0x10a/0x4b0
> rcu_gp_kthread+0xd3/0x160
> kthread+0xff/0x210
> ? rt_spin_lock+0x3c/0xc0
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x34/0x50
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.15.0-rc1-test-
> dirty #746 PREEMPT_{RT,(full)}
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> RIP: 0010:pv_native_safe_halt+0xf/0x20
> Code: 22 df e9 1f 08 e5 fe 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90
> 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 85 96 15 00 fb f4 <e9> f7
> 07 e5 fe 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> RSP: 0018:ffffffff95803e50 EFLAGS: 00000216
> RAX: ffff8e2d61534000 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000081f8a6c
> RBP: ffffffff9581d280 R08: 0000000000000000 R09: ffff8e2cf7d32301
> R10: ffff8e2be11ae5c8 R11: 0000000000000001 R12: 0000000000000000
> R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000147b0
> FS: 0000000000000000(0000) GS:ffff8e2d61534000(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055e77c3a5128 CR3: 000000010ff78003 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
> <TASK>
> default_idle+0x9/0x20
> default_idle_call+0x30/0x100
> do_idle+0x20f/0x250
> ? do_idle+0xb/0x250
> cpu_startup_entry+0x29/0x30
> rest_init+0xde/0x100
> start_kernel+0x733/0xb20
> ? copy_bootdata+0x9/0xb0
> x86_64_start_reservations+0x18/0x30
> x86_64_start_kernel+0xba/0x110
> common_startup_64+0x13e/0x141
> </TASK>
>
> Is this in line with what you are seeing?
>
Yes, and if you wait a bit longer for the second reporting round, you
should get more task backtraces as well.
Jan
--
Siemens AG, Foundational Technologies
Linux Expert Center
Hello Jan,
On 4/15/2025 3:51 PM, Jan Kiszka wrote:
>> Is this in line with what you are seeing?
>>
>
> Yes, and if you wait a bit longer for the second reporting round, you
> should get more task backtraces as well.
So looking at the backtrace [1], Aaron's patch should help with the
stalls you are seeing.
timerfd that queues a hrtimer also uses ep_poll_callback() to wakeup
the epoll waiter which queues ahead of the bandwidth timer and
requires the read lock but now since the writer tried to grab the
lock pushing readers on the slowpath. if epoll-stall-writer is now
throttled, it needs ktimer to replenish its bandwidth which cannot
happen without it grabbing the read lock first.
# epoll-stall-writer
ep_poll()
{
...
/*
* Does not disable IRQ / preemption on PREEMPT_RT; sends future readers on
* rwlock slowpath and they have to wait until epoll-stall-writer acquires
* and drops the write lock.
*/
write_lock_irq(&ep->lock);
__set_current_state(TASK_INTERRUPTIBLE);
/************** Preempted due to lack of bandwidth **************/
...
eavail = ep_events_available(ep);
if (!eavail)
__add_wait_queue_exclusive(&ep->wq, &wait);
/* Never reaches here waiting for bandwidth */
write_unlock_irq(&ep->lock);
}
# ktimers
ep_poll_callback(...)
{
...
/*
* Does not disable interrupts on PREEMPT_RT; ktimers needs the
* epoll-stall-writer to take the write lock and drop it to
* proceed but epoll-stall-writer requires ktimers to run the
* bandwidth timer to be runnable again. Deadlock!
*/
read_lock_irqsave(&ep->lock, flags);
...
/* wakeup within read side critical section */
if (sync)
wake_up_sync(&ep->wq);
else
wake_up(&ep->wq);
...
read_unlock_irqrestore(&ep->lock, flags);
}
[1] https://lore.kernel.org/all/62304351-7fc0-48b6-883b-d346886dac8e@amd.com/
>
> Jan
>
--
Thanks and Regards,
Prateek
On 2025/4/9 20:07, Aaron Lu wrote:
> This is a continuous work based on Valentin Schneider's posting here:
> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
>
> Valentin has described the problem very well in the above link. We also
> have task hung problem from time to time in our environment due to cfs quota.
> It is mostly visible with rwsem: when a reader is throttled, writer comes in
> and has to wait, the writer also makes all subsequent readers wait,
> causing problems of priority inversion or even whole system hung.
>
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, mark its throttled status but do not
> remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
> when they get picked, add a task work to them so that when they return
> to user, they can be dequeued. In this way, tasks throttled will not
> hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
> those throttled tasks.
>
> There are consequences because of this new throttle model, e.g. for a
> cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> return2user path, one task still running in kernel mode, this cfs_rq is
> in a partial throttled state:
> - Should its pelt clock be frozen?
> - Should this state be accounted into throttled_time?
>
> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.
Seems reasonable to me, although I'm wondering is it possible or desirable
to implement per-task PELT freeze?
>
> For throttle time accounting, I can see several possibilities:
> - Similar to current behavior: starts accounting when cfs_rq gets
> throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq
> gets unthrottled. This has one drawback, e.g. if this cfs_rq has one
> task when it gets throttled and eventually, that task doesn't return
> to user but blocks, then this cfs_rq has no tasks on throttled list
> but time is accounted as throttled; Patch2 and patch3 implements this
> accounting(simple, fewer code change).
> - Starts accounting when the throttled cfs_rq has at least one task on
> its throttled list; stops accounting when it's unthrottled. This kind
> of over accounts throttled time because partial throttle state is
> accounted.
> - Starts accounting when the throttled cfs_rq has no tasks left and its
> throttled list is not empty; stops accounting when this cfs_rq is
> unthrottled; This kind of under accounts throttled time because partial
> throttle state is not accounted. Patch7 implements this accounting.
> I do not have a strong feeling which accounting is the best, it's open
> for discussion.
I personally prefer option 2, which has a more practical throttled time,
so we can know how long there are some tasks throttled in fact.
Thanks!
>
> There is also the concern of increased duration of (un)throttle operations
> in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
> setup on a 2sockets/384cpus AMD server, the longest duration of
> distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
> https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
> For throttle path, with Chengming's suggestion to move "task work setup"
> from throttle time to pick time, it's not an issue anymore.
>
> Patches:
> Patch1 is preparation work;
>
> Patch2-3 provide the main functionality.
> Patch2 deals with throttle path: when a cfs_rq is to be throttled, mark
> throttled status for this cfs_rq and when tasks in throttled hierarchy
> gets picked, add a task work to them so that when those tasks return to
> user space, the task work can throttle it by dequeuing the task and
> remember this by adding the task to its cfs_rq's limbo list;
> Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
> enqueue back those tasks in limbo list;
>
> Patch4 deals with the dequeue path when task changes group, sched class
> etc. Task that is throttled is dequeued in fair, but task->on_rq is
> still set so when it changes task group or sched class or has affinity
> setting change, core will firstly dequeue it. But since this task is
> already dequeued in fair class, this patch handle this situation.
>
> Patch5-6 are clean ups. Some code are obsolete after switching to task
> based throttle mechanism.
>
> Patch7 implements an alternative accounting mechanism for task based
> throttle.
>
> Changes since v1:
> - Move "add task work" from throttle time to pick time, suggested by
> Chengming Zhou;
> - Use scope_gard() and cond_resched_tasks_rcu_qs() in
> throttle_cfs_rq_work(), suggested by K Prateek Nayak;
> - Remove now obsolete throttled_lb_pair(), suggested by K Prateek Nayak;
> - Fix cfs_rq->runtime_remaining condition check in unthrottle_cfs_rq(),
> suggested by K Prateek Nayak;
> - Fix h_nr_runnable accounting for delayed dequeue case when task based
> throttle is in use;
> - Implemented an alternative way of throttle time accounting for
> discussion purpose;
> - Make !CONFIG_CFS_BANDWIDTH build.
> I hope I didn't omit any feedbacks I've received, but feel free to let me
> know if I did.
>
> As in v1, all change logs are written by me and if they read bad, it's
> my fault.
>
> Comments are welcome.
>
> Base commit: tip/sched/core, commit 6432e163ba1b("sched/isolation: Make
> use of more than one housekeeping cpu").
>
> Aaron Lu (4):
> sched/fair: Take care of group/affinity/sched_class change for
> throttled task
> sched/fair: get rid of throttled_lb_pair()
> sched/fair: fix h_nr_runnable accounting with per-task throttle
> sched/fair: alternative way of accounting throttle time
>
> Valentin Schneider (3):
> sched/fair: Add related data structure for task based throttle
> sched/fair: Handle throttle path for task based throttle
> sched/fair: Handle unthrottle path for task based throttle
>
> include/linux/sched.h | 4 +
> kernel/sched/core.c | 3 +
> kernel/sched/fair.c | 449 ++++++++++++++++++++++--------------------
> kernel/sched/sched.h | 7 +
> 4 files changed, 248 insertions(+), 215 deletions(-)
>
On Mon, Apr 14, 2025 at 11:05:30AM +0800, Chengming Zhou wrote:
> On 2025/4/9 20:07, Aaron Lu wrote:
> > This is a continuous work based on Valentin Schneider's posting here:
> > Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> > https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
> >
> > Valentin has described the problem very well in the above link. We also
> > have task hung problem from time to time in our environment due to cfs quota.
> > It is mostly visible with rwsem: when a reader is throttled, writer comes in
> > and has to wait, the writer also makes all subsequent readers wait,
> > causing problems of priority inversion or even whole system hung.
> >
> > To improve this situation, change the throttle model to task based, i.e.
> > when a cfs_rq is throttled, mark its throttled status but do not
> > remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq,
> > when they get picked, add a task work to them so that when they return
> > to user, they can be dequeued. In this way, tasks throttled will not
> > hold any kernel resources. When cfs_rq gets unthrottled, enqueue back
> > those throttled tasks.
> >
> > There are consequences because of this new throttle model, e.g. for a
> > cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> > return2user path, one task still running in kernel mode, this cfs_rq is
> > in a partial throttled state:
> > - Should its pelt clock be frozen?
> > - Should this state be accounted into throttled_time?
> >
> > For pelt clock, I chose to keep the current behavior to freeze it on
> > cfs_rq's throttle time. The assumption is that tasks running in kernel
> > mode should not last too long, freezing the cfs_rq's pelt clock can keep
> > its load and its corresponding sched_entity's weight. Hopefully, this can
> > result in a stable situation for the remaining running tasks to quickly
> > finish their jobs in kernel mode.
>
> Seems reasonable to me, although I'm wondering is it possible or desirable
> to implement per-task PELT freeze?
Interesting idea.
One thing I'm thinking, would per-task PELT freeze cause task and its
cfs_rq's pelt clock un-sync? If so, I feel it would create some headaches
but I haven't thought through this yet.
> >
> > For throttle time accounting, I can see several possibilities:
> > - Similar to current behavior: starts accounting when cfs_rq gets
> > throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq
> > gets unthrottled. This has one drawback, e.g. if this cfs_rq has one
> > task when it gets throttled and eventually, that task doesn't return
> > to user but blocks, then this cfs_rq has no tasks on throttled list
> > but time is accounted as throttled; Patch2 and patch3 implements this
> > accounting(simple, fewer code change).
> > - Starts accounting when the throttled cfs_rq has at least one task on
> > its throttled list; stops accounting when it's unthrottled. This kind
> > of over accounts throttled time because partial throttle state is
> > accounted.
> > - Starts accounting when the throttled cfs_rq has no tasks left and its
> > throttled list is not empty; stops accounting when this cfs_rq is
> > unthrottled; This kind of under accounts throttled time because partial
> > throttle state is not accounted. Patch7 implements this accounting.
> > I do not have a strong feeling which accounting is the best, it's open
> > for discussion.
>
> I personally prefer option 2, which has a more practical throttled time,
> so we can know how long there are some tasks throttled in fact.
>
> Thanks!
Thanks for the input.
Now I think about this more, I feel option 2 is essentially a better
version of option 1 because it doesn't have the drawback of option 1
I mentioned above, so option 1 should probably just be ruled out.
Then there are only 2 options to consider and their difference is
basically whether to treat partial throttle state as throttled or not.
Thanks,
Aaron
> >
> > There is also the concern of increased duration of (un)throttle operations
> > in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
> > setup on a 2sockets/384cpus AMD server, the longest duration of
> > distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
> > https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
> > For throttle path, with Chengming's suggestion to move "task work setup"
> > from throttle time to pick time, it's not an issue anymore.
> >
> > Patches:
> > Patch1 is preparation work;
> >
> > Patch2-3 provide the main functionality.
> > Patch2 deals with throttle path: when a cfs_rq is to be throttled, mark
> > throttled status for this cfs_rq and when tasks in throttled hierarchy
> > gets picked, add a task work to them so that when those tasks return to
> > user space, the task work can throttle it by dequeuing the task and
> > remember this by adding the task to its cfs_rq's limbo list;
> > Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
> > enqueue back those tasks in limbo list;
> >
> > Patch4 deals with the dequeue path when task changes group, sched class
> > etc. Task that is throttled is dequeued in fair, but task->on_rq is
> > still set so when it changes task group or sched class or has affinity
> > setting change, core will firstly dequeue it. But since this task is
> > already dequeued in fair class, this patch handle this situation.
> >
> > Patch5-6 are clean ups. Some code are obsolete after switching to task
> > based throttle mechanism.
> >
> > Patch7 implements an alternative accounting mechanism for task based
> > throttle.
> >
> > Changes since v1:
> > - Move "add task work" from throttle time to pick time, suggested by
> > Chengming Zhou;
> > - Use scope_gard() and cond_resched_tasks_rcu_qs() in
> > throttle_cfs_rq_work(), suggested by K Prateek Nayak;
> > - Remove now obsolete throttled_lb_pair(), suggested by K Prateek Nayak;
> > - Fix cfs_rq->runtime_remaining condition check in unthrottle_cfs_rq(),
> > suggested by K Prateek Nayak;
> > - Fix h_nr_runnable accounting for delayed dequeue case when task based
> > throttle is in use;
> > - Implemented an alternative way of throttle time accounting for
> > discussion purpose;
> > - Make !CONFIG_CFS_BANDWIDTH build.
> > I hope I didn't omit any feedbacks I've received, but feel free to let me
> > know if I did.
> >
> > As in v1, all change logs are written by me and if they read bad, it's
> > my fault.
> >
> > Comments are welcome.
> >
> > Base commit: tip/sched/core, commit 6432e163ba1b("sched/isolation: Make
> > use of more than one housekeeping cpu").
> >
> > Aaron Lu (4):
> > sched/fair: Take care of group/affinity/sched_class change for
> > throttled task
> > sched/fair: get rid of throttled_lb_pair()
> > sched/fair: fix h_nr_runnable accounting with per-task throttle
> > sched/fair: alternative way of accounting throttle time
> >
> > Valentin Schneider (3):
> > sched/fair: Add related data structure for task based throttle
> > sched/fair: Handle throttle path for task based throttle
> > sched/fair: Handle unthrottle path for task based throttle
> >
> > include/linux/sched.h | 4 +
> > kernel/sched/core.c | 3 +
> > kernel/sched/fair.c | 449 ++++++++++++++++++++++--------------------
> > kernel/sched/sched.h | 7 +
> > 4 files changed, 248 insertions(+), 215 deletions(-)
> >
Hello Aaron, On 4/9/2025 5:37 PM, Aaron Lu wrote: > This is a continuous work based on Valentin Schneider's posting here: > Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry > https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/ > > Valentin has described the problem very well in the above link. We also > have task hung problem from time to time in our environment due to cfs quota. > It is mostly visible with rwsem: when a reader is throttled, writer comes in > and has to wait, the writer also makes all subsequent readers wait, > causing problems of priority inversion or even whole system hung. > > To improve this situation, change the throttle model to task based, i.e. > when a cfs_rq is throttled, mark its throttled status but do not > remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq, > when they get picked, add a task work to them so that when they return > to user, they can be dequeued. In this way, tasks throttled will not > hold any kernel resources. When cfs_rq gets unthrottled, enqueue back > those throttled tasks. I tried to reproduce the scenario that Valentin describes in the parallel thread [1] and I haven't run into a stall yet with this series applied on top of v6.15-rc1 [2]. So for Patch 1-6, feel free to add: Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> I'm slowly crawling through the series and haven't gotten to the Patch 7 yet but so far I haven't seen anything unexpected in my initial testing and it seems to solve possible circular dependency on PREEMPT_RT with bandwidth replenishment (Sebastian has some doubts if my reproducer is correct but that discussion is for the other thread) Thank you for working on this and a big thanks to Valentin for the solid groundwork. [1] https://lore.kernel.org/linux-rt-users/f2e2c74c-b15d-4185-a6ea-4a19eee02417@amd.com/ [2] https://lore.kernel.org/linux-rt-users/534df953-3cfb-4b3d-8953-5ed9ef24eabc@amd.com/ > > There are consequences because of this new throttle model, e.g. for a > cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their > return2user path, one task still running in kernel mode, this cfs_rq is > in a partial throttled state: > - Should its pelt clock be frozen? > - Should this state be accounted into throttled_time? > > For pelt clock, I chose to keep the current behavior to freeze it on > cfs_rq's throttle time. The assumption is that tasks running in kernel > mode should not last too long, freezing the cfs_rq's pelt clock can keep > its load and its corresponding sched_entity's weight. Hopefully, this can > result in a stable situation for the remaining running tasks to quickly > finish their jobs in kernel mode. > > For throttle time accounting, I can see several possibilities: > - Similar to current behavior: starts accounting when cfs_rq gets > throttled(if cfs_rq->nr_queued > 0) and stops accounting when cfs_rq > gets unthrottled. This has one drawback, e.g. if this cfs_rq has one > task when it gets throttled and eventually, that task doesn't return > to user but blocks, then this cfs_rq has no tasks on throttled list > but time is accounted as throttled; Patch2 and patch3 implements this > accounting(simple, fewer code change). > - Starts accounting when the throttled cfs_rq has at least one task on > its throttled list; stops accounting when it's unthrottled. This kind > of over accounts throttled time because partial throttle state is > accounted. > - Starts accounting when the throttled cfs_rq has no tasks left and its > throttled list is not empty; stops accounting when this cfs_rq is > unthrottled; This kind of under accounts throttled time because partial > throttle state is not accounted. Patch7 implements this accounting. > I do not have a strong feeling which accounting is the best, it's open > for discussion. > > There is also the concern of increased duration of (un)throttle operations > in v1. I've done some tests and with a 2000 cgroups/20K runnable tasks > setup on a 2sockets/384cpus AMD server, the longest duration of > distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see: > https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/ > For throttle path, with Chengming's suggestion to move "task work setup" > from throttle time to pick time, it's not an issue anymore. > [..snip..] -- Thanks and Regards, Prateek
Hi Prateek, On Mon, Apr 14, 2025 at 10:04:02PM +0530, K Prateek Nayak wrote: > Hello Aaron, > > On 4/9/2025 5:37 PM, Aaron Lu wrote: > > This is a continuous work based on Valentin Schneider's posting here: > > Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry > > https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/ > > > > Valentin has described the problem very well in the above link. We also > > have task hung problem from time to time in our environment due to cfs quota. > > It is mostly visible with rwsem: when a reader is throttled, writer comes in > > and has to wait, the writer also makes all subsequent readers wait, > > causing problems of priority inversion or even whole system hung. > > > > To improve this situation, change the throttle model to task based, i.e. > > when a cfs_rq is throttled, mark its throttled status but do not > > remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq, > > when they get picked, add a task work to them so that when they return > > to user, they can be dequeued. In this way, tasks throttled will not > > hold any kernel resources. When cfs_rq gets unthrottled, enqueue back > > those throttled tasks. > > I tried to reproduce the scenario that Valentin describes in the > parallel thread [1] and I haven't run into a stall yet with this > series applied on top of v6.15-rc1 [2]. Great to hear this. > So for Patch 1-6, feel free to add: > > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Thanks! > I'm slowly crawling through the series and haven't gotten to the Patch 7 > yet but so far I haven't seen anything unexpected in my initial testing > and it seems to solve possible circular dependency on PREEMPT_RT with > bandwidth replenishment (Sebastian has some doubts if my reproducer is > correct but that discussion is for the other thread) > > Thank you for working on this and a big thanks to Valentin for the solid > groundwork. Thank you a ton for your review and test. > [1] https://lore.kernel.org/linux-rt-users/f2e2c74c-b15d-4185-a6ea-4a19eee02417@amd.com/ > [2] https://lore.kernel.org/linux-rt-users/534df953-3cfb-4b3d-8953-5ed9ef24eabc@amd.com/ Best regards, Aaron
© 2016 - 2026 Red Hat, Inc.