include/linux/entry-common.h | 10 +- include/linux/sched.h | 45 +- init/init_task.c | 5 +- kernel/entry/common.c | 17 + kernel/entry/common.h | 4 + kernel/sched/core.c | 8 +- kernel/sched/debug.c | 2 +- kernel/sched/fair.c | 935 ++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 8 +- 9 files changed, 950 insertions(+), 84 deletions(-)
Hello everyone,
tl;dr
This is a spiritual successor to Valentin's series to defer CFS throttle
to user entry [1] however this does things differently: Instead of
moving to a per-task throttling, this retains the throttling at cfs_rq
level while tracking the tasks preempted in kernel mode queued on the
cfs_rq.
On a throttled hierarchy, only the kernel mode preempted entities are
considered runnable and the pick_task_fair() will only select among these
entities in EEVDF fashion using a min-heap approach (Patch 6-8)
This is an early prototype with few issues (please see "Broken bits and
disclaimer" section) however I'm putting it out to get some early
feedback.
This series is based on
git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
at commit d34e798094ca ("sched/fair: Refactor can_migrate_task() to
elimate looping")
With that quick tl;dr sorted, let us jump right into it
Problem statement
=================
Following are Valentin's words taken from [1] with few changes to the
reference commits now that PREEMPT_RT has landed upstream:
CFS tasks can end up throttled while holding locks that other,
non-throttled tasks are blocking on.
For !PREEMPT_RT, this can be a source of latency due to the throttling
causing a resource acquisition denial.
For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter
the slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)
If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued
on the same CPU as one where ktimers/ksoftirqd is blocked on
read_lock(&lock), this creates a circular dependency.
This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.
For PREEMPT_RT, the upstream kernel added commit 49a17639508c ("softirq:
Use a dedicated thread for timer wakeups on PREEMPT_RT.") which helped
this scenario for non-rwlock locks by ensuring the throttled task would
get PI'd to FIFO1 (ktimers' default priority). Unfortunately, rwlocks
cannot sanely do PI as they allow multiple readers.
Throttle deferral was discussed at OSPM'24 [2] and at LPC'24 [3] with
recordings for both available on Youtube.
Proposed approach
=================
This approach builds on Ben Segall's proposal in [4] which marked the
task in schedule() when exiting to usermode by setting
"in_return_to_user" flag except this prototype takes it a step ahead and
marks a "kernel critical section" within the syscall boundary using a
per-task "kernel_cs_count".
The rationale behind this approach is that the task can only hold
kernel resources when running in kernel mode in preemptible context. In
this POC, the entire syscall boundary is marked as a kernel critical
section but in the future, the API can be used to mark fine grained
boundaries like between an up_read(), down_read() or up_write(),
down_write() pair.
Within a kernel critical section, throttling events are deferred until
the task's "kernel_cs_count" hits 0. Currently this count is an integer
to catch any cases where the count turns negative as a result of
oversights on my part but this could be changed to a preempt count like
mechanism to request a resched.
cfs_rq throttled picked again
v v
----|*********| (preempted by tick / wakeup) |***********| (full throttle)
^ ^
critical section cfs_rq is throttled partially critical section
entry since the task is still exit
runnable as it was preempted in
kernel critical section
The EEVDF infrastructure is extended to tracks the avg_vruntime and the
avg_load of only those entities preempted in kernel mode. When a cfs_rq
is throttled, it uses these metrics to select among the kernel mode
preempted tasks and running them till they exit to user mode.
pick_eevdf() is made aware that it is operating on a throttled hierarchy
to only select among these tasks that were preempted in kernel mode (and
the sched entities of cfs_rq that lead to them). When a throttled
entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
the hierarchy remains accessible for picking until that point.
root
/ \
A B * (throttled)
... / | \
0 1* 2*
(*) Preempted in kernel mode
o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
o avg_kcs_load = load(1) + load(2)
o throttled_vruntime_eligible:
entity preempted in kernel mode &&
entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime
o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
that propagates the smallest vruntime of the all the entities in
the subtree that are preempted in kernel mode. If they were
executing in usermode when preempted, this will be set to LLONG_MAX.
This is used to construct a min-heap and select through the
entities. Consider rbtree of B to look like:
1*
/ \
2* 0
min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
min_kcs_vruntime = min(se->min_kcs_vruntime,
__node_2_se(rb_left)->min_kcs_vruntime,
__node_2_se(rb_right)->min_kcs_vruntime);
pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
tree to first check the left subtree for eligibility, then the node
itself, and then the right subtree.
With proactive tracking, throttle deferral can retain throttling at per
cfs_rq granularity instead of moving to a per-task model.
On the throttling side, a third throttle state is introduced
CFS_THROTTLED_PARTIAL which indicates that the cfs_rq has run out of its
bandwidth but contains runnable (kernel mode preempted) entities. A
partially throttled cfs_rq is added to the throttled list so it can be
unthrottled in a timely manner during distribution making all the tasks
queued on it runnable again. Throttle status can be promoted or demoted
depending on whether a kernel mode preempted tasks exits to user mode,
blocks, or wakes up on the hierarchy.
The most tricky part of the series is handling of the current task which
is kept out of kernel mode status tracking since a running task can make
multiple syscall and propagating those signals up the cgroup hierarchy
can be expensive. Instead the status of the current task is only looked
at during schedule() to make throttling decisions. This leads to some
interesting considerations in some paths (see curr_h_is_throttled() in
Patch 8 for EEVDF handling)
Testing
=======
The series was tested for correctness using a single hackbench instance
in a bandwidth controlled cgroup to observe the following set of events:
o Throttle into partial state followed by a full throttle
hackbench-1602 [044] dN.2. 55.684108: throttle_cfs_rq: cfs_rq throttled: (0) -> (1)
hackbench-1602 [044] dN.2. 55.684111: pick_task_fair: Task picked on throttled hierarchy: hackbench(1602)
hackbench-1602 [044] d.... 55.684117: sched_notify_critical_section_exit: Pick on throttled requesting resched
hackbench-1602 [044] dN.2. 55.684119: throttle_cfs_rq: cfs_rq throttled: (1) -> (2)
...
<idle>-0 [044] dNh2. 55.689677: unthrottle_cfs_rq: cfs_rq unthrottled: (2) -> (0)
o Full throttle demoted to partial and then reupgraded
<idle>-0 [006] dN.2. 55.592145: unthrottle_cfs_rq: cfs_rq unthrottled: (2) -> (1)
<idle>-0 [006] dN.2. 55.592147: pick_task_fair: Task picked on throttled hierarchy: hackbench(1584)
...
hackbench-1584 [006] d.... 55.592154: sched_notify_critical_section_exit: Pick on throttled requesting resched
...
hackbench-1584 [006] dN.2. 55.592157: throttle_cfs_rq: cfs_rq throttled: (1) -> (2)
[ Note: (1) corresponds to CFS_THROTTLE_PARTIAL (throttled but with
kernel mode preempted entities that are still runnable) and (2)
corresponds to CFS_THROTTLED (full throttle with no runnable task) ]
The stability testing was done by running the following script:
#!/bin/bash
mkdir /sys/fs/cgroup/CG0
echo "+cpu" > /sys/fs/cgroup/CG0/cgroup.subtree_control
echo "+cpuset" > /sys/fs/cgroup/CG0/cgroup.subtree_control
mkdir /sys/fs/cgroup/CG0/CG00
mkdir /sys/fs/cgroup/CG0/CG01
echo "+cpu" > /sys/fs/cgroup/CG0/CG00/cgroup.subtree_control
echo "+cpu" > /sys/fs/cgroup/CG0/CG01/cgroup.subtree_control
echo "+cpuset" > /sys/fs/cgroup/CG0/CG01/cgroup.subtree_control
echo "+cpuset" > /sys/fs/cgroup/CG0/CG00/cgroup.subtree_control
mkdir /sys/fs/cgroup/CG0/CG00/CG000
mkdir /sys/fs/cgroup/CG0/CG00/CG001
mkdir /sys/fs/cgroup/CG0/CG01/CG000
mkdir /sys/fs/cgroup/CG0/CG01/CG001
echo "+cpu" > /sys/fs/cgroup/CG0/CG00/CG001/cgroup.subtree_control
echo "+cpu" > /sys/fs/cgroup/CG0/CG00/CG000/cgroup.subtree_control
echo "+cpu" > /sys/fs/cgroup/CG0/CG01/CG000/cgroup.subtree_control
echo "+cpu" > /sys/fs/cgroup/CG0/CG01/CG001/cgroup.subtree_control
echo "+cpuset" > /sys/fs/cgroup/CG0/CG01/CG000/cgroup.subtree_control
echo "+cpuset" > /sys/fs/cgroup/CG0/CG01/CG001/cgroup.subtree_control
echo "+cpuset" > /sys/fs/cgroup/CG0/CG00/CG000/cgroup.subtree_control
echo "+cpuset" > /sys/fs/cgroup/CG0/CG00/CG001/cgroup.subtree_control
echo "3200000 100000" > /sys/fs/cgroup/CG0/cpu.max
echo "1600000 100000" > /sys/fs/cgroup/CG0/CG00/cpu.max
echo "1600000 100000" > /sys/fs/cgroup/CG0/CG01/cpu.max
echo "800000 100000" > /sys/fs/cgroup/CG0/CG00/CG000/cpu.max
echo "800000 100000" > /sys/fs/cgroup/CG0/CG00/CG001/cpu.max
echo "800000 100000" > /sys/fs/cgroup/CG0/CG01/CG000/cpu.max
echo "800000 100000" > /sys/fs/cgroup/CG0/CG01/CG001/cpu.max
sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG00/CG000/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG00/CG001/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG01/CG000/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG01/CG001/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
time wait;
---
The above script was stress tested with 1,2,4, and 8 groups in a 64vCPU
VM, and with upto 64 groups on a 3rd Generation EPYC system (2 x
64C/128T) without experiencing any obvious issues.
Since I had run into various issues initially with this prototype, I've
retained a debug patch towards the end that dumps the rbtree state when
pick_eevdf() fails to select a task. Since it is a debug patch, all the
checkpatch warnings and errors have been ignored. I apologize in advance
for any inconvenience there.
Broken bits and disclaimer
==========================
In ascending order of priority:
- Few SCHED_WARN_ON() added assume only the syscall boundary is a
critical section. These can go if this infrastructure is used for more
fine-grained deferral.
- !CONFIG_CFS_BANDWIDTH and !CONFIG_SCHED_AUTOGROUP has only been build
tested. Most (almost all) of the testing was done with
CONFIG_CFS_BANDWIDTH enabled on a x86 system and a VM.
- The current PoC only support achitectures that select GENERIC_ENTRY.
Other architectures need to use the sched_notify_critical_section*()
API to mark the kernel critical section boundaries.
- Some abstractions are not very clean. For example, generic layer uses
se_in_kernel() often passing &p->se as the argument. Instead, the
task_struct can be sent directly to a wrapper around se_in_kernel().
- There is currently no way to opt out of throttle deferral however, it
is possible to introduce a sysctl hook to enable / disable the feature
on demand by only allowing propagation of "kernel_cs_count" down the
cgroup hierarchy when the feature is enabled and a static key to guard
key decision points that can lead to partial throttle.A simpler option
is a boot time switch.
- On a partially throttled hierarchy, only kernel mode preempted tasks
and the entities that lead to them can make progress. This can
theoretically lead them to accumulate unbounded amount of -ve lag
which can lead to long wait times for these entities when the
hierarchy is unthrottled.
- PELT requires taking some look at since the number of runnable
entities on a throttled hierarchy is not the same as "h_nr_runnable".
Audit all references to "cfs_rq->h_nr_runnable" and "se->runnable_sum"
and address those paths accordingly.
- Partially throttled hierarchies do not set the hierarchical indicators
namely "cfs_rq->throttled_count" anymore. Audit all uses of
throttled_hierarchy() to see if they need adjusting. At least two
cases, one in throttled_lb_pair(), and the other in
class->task_is_throttled() callback used by core scheduling requires
auditing.
Clarifications required
=======================
- A kernel thread on a throttled hierarchy will always partially
unthrottle it and make itself runnable. Is this possible? (at least
one case with osnoise was found) Is this desired?
- A new task always starts off in kernel mode and can unthrottle the
hierarchy to make itself runnable. Is this a desired behavior?
- There are a few "XXX" and "TODO" along that way that I'll be
revisiting. Any comments there is greatly appreciated.
- Patch 21 seems to solve a heap integrity issue that I have
encountered. Late into stress testing I occasionally came across a
scenario where the rbroot does not have the correct min_kcs_vruntime
set. An example dump is as follows:
(Sorry in advance for the size of this dump)
CPU43: ----- current task ----
CPU43: pid(0) comm(swapper/43) task_cpu(43) task_on_rq_queued(1) task_on_rq_migrating(0) normal_policy(1) idle_policy(0)
CPU43: ----- current task done ----
CPU43: ----- cfs_rq ----
CPU43: cfs_rq: throttled?(1) cfs_rq->throttled(1) h_nr_queued(24) h_nr_runnable(24) nr_queued(24) gse->kernel_cs_count(1)
CPU43: cfs_rq EEVDF: avg_vruntime(3043328) avg_load(24576) avg_kcs_vruntime(125952) avg_kcs_load(1024)
CPU43: ----- cfs_rq done ----
CPU43: ----- rbtree traversal: root ----
CPU43: se: load(1024) vruntime(6057209863) entity_key(12716) deadline(6059998121) min_vruntime(6057124320) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057184139) pick_entity(0)
CPU43: ----- Left Subtree ----
CPU43: se: load(1024) vruntime(6057198486) entity_key(1339) deadline(6059991924) min_vruntime(6057124320) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057184139) pick_entity(0)
CPU43: ----- Left Subtree ----
CPU43: se: load(1024) vruntime(6057128111) entity_key(-69036) deadline(6059919846) min_vruntime(6057124320) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057184139) pick_entity(0)
CPU43: ----- Left Subtree ----
CPU43: se: load(1024) vruntime(6057202668) entity_key(5521) deadline(6059894986) min_vruntime(6057124320) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057124320) entity_key(-72827) deadline(6059915644) min_vruntime(6057124320) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Right Subtree Done ----
CPU43: ----- Left Subtree Done ----
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057196261) entity_key(-886) deadline(6059984139) min_vruntime(6057176590) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Left Subtree ----
CPU43: se: load(1024) vruntime(6057176590) entity_key(-20557) deadline(6059967663) min_vruntime(6057176590) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Left Subtree Done ----
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057198406) entity_key(1259) deadline(6059990321) min_vruntime(6057198406) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Right Subtree Done ----
CPU43: ----- Right Subtree Done ----
CPU43: ----- Left Subtree Done ----
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057206114) entity_key(8967) deadline(6059995624) min_vruntime(6057197270) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057197270) pick_entity(0)
CPU43: ----- Left Subtree ----
CPU43: se: load(1024) vruntime(6057202212) entity_key(5065) deadline(6059994879) min_vruntime(6057197431) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Left Subtree ----
CPU43: se: load(1024) vruntime(6057201766) entity_key(4619) deadline(6059993029) min_vruntime(6057197431) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Left Subtree ----
CPU43: se: load(1024) vruntime(6057197431) entity_key(284) deadline(6059992361) min_vruntime(6057197431) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Left Subtree Done ----
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057201845) entity_key(4698) deadline(6059994381) min_vruntime(6057201845) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Right Subtree Done ----
CPU43: ----- Left Subtree Done ----
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057201956) entity_key(4809) deadline(6059995233) min_vruntime(6057201956) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Right Subtree Done ----
CPU43: ----- Left Subtree Done ----
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057203193) entity_key(6046) deadline(6059996320) min_vruntime(6057197270) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057197270) pick_entity(0)
CPU43: ----- Left Subtree ----
CPU43: se: load(1024) vruntime(6057199892) entity_key(2745) deadline(6059995624) min_vruntime(6057199892) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Left Subtree Done ----
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057205914) entity_key(8767) deadline(6059996907) min_vruntime(6057197270) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057197270) pick_entity(0)
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057197270) entity_key(123) deadline(6059997270) min_vruntime(6057197270) on_rq(1)
CPU43: se kcs: kernel_cs_count(1) min_kcs_vruntime(6057197270) pick_entity(1)
CPU43: ----- Right Subtree Done ----
CPU43: ----- Right Subtree Done ----
CPU43: ----- Right Subtree Done ----
CPU43: ----- Right Subtree Done ----
CPU43: ----- Left Subtree Done ----
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057210834) entity_key(13687) deadline(6060002729) min_vruntime(6057209600) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Left Subtree ----
CPU43: se: load(1024) vruntime(6057209600) entity_key(12453) deadline(6060000733) min_vruntime(6057209600) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Left Subtree Done ----
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057215347) entity_key(18200) deadline(6060005989) min_vruntime(6057215103) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Left Subtree ----
CPU43: se: load(1024) vruntime(6057215103) entity_key(17956) deadline(6060004643) min_vruntime(6057215103) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Left Subtree Done ----
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057215276) entity_key(18129) deadline(6060007381) min_vruntime(6057215276) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Right Subtree ----
CPU43: se: load(1024) vruntime(6057216042) entity_key(18895) deadline(6060008258) min_vruntime(6057216042) on_rq(1)
CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
CPU43: ----- Right Subtree Done ----
CPU43: ----- Right Subtree Done ----
CPU43: ----- Right Subtree Done ----
CPU43: ----- Right Subtree Done ----
CPU43: ----- rbtree done ----
CPU43: ----- parent cfs_rq ----
CPU43: ----- cfs_rq ----
CPU43: cfs_rq: throttled?(1) cfs_rq->throttled(0) h_nr_queued(24) h_nr_runnable(24) nr_queued(1) gse->kernel_cs_count(1)
CPU43: cfs_rq EEVDF: avg_vruntime(0) avg_load(312) avg_kcs_vruntime(0) avg_kcs_load(312)
CPU43: ----- cfs_rq done ----
CPU43: ----- parent cfs_rq ----
CPU43: ----- cfs_rq ----
CPU43: cfs_rq: throttled?(1) cfs_rq->throttled(0) h_nr_queued(24) h_nr_runnable(24) nr_queued(1) gse->kernel_cs_count(1)
CPU43: cfs_rq EEVDF: avg_vruntime(0) avg_load(206) avg_kcs_vruntime(0) avg_kcs_load(206)
CPU43: ----- cfs_rq done ----
CPU43: ----- cfs_rq ----
CPU43: cfs_rq: throttled?(0) cfs_rq->throttled(0) h_nr_queued(24) h_nr_runnable(24) nr_queued(1) gse->kernel_cs_count(-1)
CPU43: cfs_rq EEVDF: avg_vruntime(0) avg_load(156) avg_kcs_vruntime(0) avg_kcs_load(156)
CPU43: ----- cfs_rq done ----
The vruntime of entity corresponding to "pick_entity(1)" should have
been propagated to the root as min_kcs_vruntime but that does not seem
to happen. This happens unpredictably and the amount of logs are
overwhelming to make any sense of which series of rbtree operations
lead up to it but grouping all the min_* members and setting it as
RBAUGMENT as implemented in Patch 21 seem to make the issue go away.
If folks think this is a genuine bugfix and not me masking a bigger
issue, I can send this fix separately for min_vruntime and min_slice
propagation and rebase my series on top of it.
Credits
=======
This series is based on the former approaches from:
Valentin Schneider <vschneid@redhat.com>
Ben Segall <bsegall@google.com>
All the credit for this idea really is due to them - while the mistakes
in implementing it are likely mine :)
Thanks to Swapnil, Dhananjay, Ravi, and Gautham for helping devise the
test strategy, discussions around the intricacies of CPU controllers,
and their help at tackling multiple subtle bugs I encountered along the
way.
References
==========
[1]: https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
[2]: https://youtu.be/YAVxQKQABLw
[3]: https://youtu.be/_N-nXJHiDNo
[4]: http://lore.kernel.org/r/xm26edfxpock.fsf@bsegall-linux.svl.corp.google.com
---
K Prateek Nayak (22):
kernel/entry/common: Move syscall_enter_from_user_mode_work() out of
header
sched/fair: Convert "se->runnable_weight" to unsigned int and pack the
struct
[PoC] kernel/entry/common: Mark syscall as a kernel critical section
[PoC] kernel/sched: Inititalize "kernel_cs_count" for new tasks
sched/fair: Track EEVDF stats for entities preempted in kernel mode
sched/fair: Propagate the min_vruntime of kernel mode preempted entity
sched/fair: Propagate preempted entity information up cgroup hierarchy
sched/fair: Allow pick_eevdf() to pick in-kernel entities on throttled
hierarchy
sched/fair: Introduce cfs_rq throttled states in preparation for
partial throttling
sched/fair: Prepare throttle_cfs_rq() to allow partial throttling
sched/fair: Prepare unthrottle_cfs_rq() to demote throttle status
sched/fair: Prepare bandwidth distribution to unthrottle partial
throttles right away
sched/fair: Correct the throttle status supplied to pick_eevdf()
sched/fair: Mark a task if it was picked on a partially throttled
hierarchy
sched/fair: Call resched_curr() from sched_notify_syscall_exit()
sched/fair: Prepare enqueue to partially unthrottle cfs_rq
sched/fair: Clear pick on throttled indicator when task leave fair
class
sched/fair: Prepare pick_next_task_fair() to unthrottle a throttled
hierarchy
sched/fair: Ignore in-kernel indicators for running task outside of
schedule()
sched/fair: Implement determine_throttle_state() for partial throttle
[MAYBE BUGFIX] sched/fair: Group all the se->min_* members together
for propagation
[DEBUG] sched/fair: Debug pick_eevdf() returning NULL!
include/linux/entry-common.h | 10 +-
include/linux/sched.h | 45 +-
init/init_task.c | 5 +-
kernel/entry/common.c | 17 +
kernel/entry/common.h | 4 +
kernel/sched/core.c | 8 +-
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 935 ++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 8 +-
9 files changed, 950 insertions(+), 84 deletions(-)
base-commit: d34e798094ca7be935b629a42f8b237d4d5b7f1d
--
2.43.0
On Thu, Feb 20, 2025 at 09:32:35AM +0000, K Prateek Nayak wrote: > Proposed approach > ================= > > This approach builds on Ben Segall's proposal in [4] which marked the > task in schedule() when exiting to usermode by setting > "in_return_to_user" flag except this prototype takes it a step ahead and > marks a "kernel critical section" within the syscall boundary using a > per-task "kernel_cs_count". > > The rationale behind this approach is that the task can only hold > kernel resources when running in kernel mode in preemptible context. In > this POC, the entire syscall boundary is marked as a kernel critical > section but in the future, the API can be used to mark fine grained > boundaries like between an up_read(), down_read() or up_write(), > down_write() pair. > > Within a kernel critical section, throttling events are deferred until > the task's "kernel_cs_count" hits 0. Currently this count is an integer > to catch any cases where the count turns negative as a result of > oversights on my part but this could be changed to a preempt count like > mechanism to request a resched. > > cfs_rq throttled picked again > v v > > ----|*********| (preempted by tick / wakeup) |***********| (full throttle) > > ^ ^ > critical section cfs_rq is throttled partially critical section > entry since the task is still exit > runnable as it was preempted in > kernel critical section > > The EEVDF infrastructure is extended to tracks the avg_vruntime and the > avg_load of only those entities preempted in kernel mode. When a cfs_rq > is throttled, it uses these metrics to select among the kernel mode > preempted tasks and running them till they exit to user mode. > pick_eevdf() is made aware that it is operating on a throttled hierarchy > to only select among these tasks that were preempted in kernel mode (and > the sched entities of cfs_rq that lead to them). When a throttled > entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but > the hierarchy remains accessible for picking until that point. > > root > / \ > A B * (throttled) > ... / | \ > 0 1* 2* > > (*) Preempted in kernel mode > > o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2) > o avg_kcs_load = load(1) + load(2) > > o throttled_vruntime_eligible: > > entity preempted in kernel mode && > entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime > > o rbtree is augmented with a "min_kcs_vruntime" field in sched entity > that propagates the smallest vruntime of the all the entities in > the subtree that are preempted in kernel mode. If they were > executing in usermode when preempted, this will be set to LLONG_MAX. > > This is used to construct a min-heap and select through the > entities. Consider rbtree of B to look like: > > 1* > / \ > 2* 0 > > min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX: > min_kcs_vruntime = min(se->min_kcs_vruntime, > __node_2_se(rb_left)->min_kcs_vruntime, > __node_2_se(rb_right)->min_kcs_vruntime); > > pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted > tree to first check the left subtree for eligibility, then the node > itself, and then the right subtree. > *groan*... why not actually dequeue the tasks and only retain those with non-zero cs_count? That avoids having to duplicate everything, no?
Hello Peter, On 2/20/2025 4:25 PM, Peter Zijlstra wrote: > On Thu, Feb 20, 2025 at 09:32:35AM +0000, K Prateek Nayak wrote: >> Proposed approach >> ================= >> >> This approach builds on Ben Segall's proposal in [4] which marked the >> task in schedule() when exiting to usermode by setting >> "in_return_to_user" flag except this prototype takes it a step ahead and >> marks a "kernel critical section" within the syscall boundary using a >> per-task "kernel_cs_count". >> >> The rationale behind this approach is that the task can only hold >> kernel resources when running in kernel mode in preemptible context. In >> this POC, the entire syscall boundary is marked as a kernel critical >> section but in the future, the API can be used to mark fine grained >> boundaries like between an up_read(), down_read() or up_write(), >> down_write() pair. >> >> Within a kernel critical section, throttling events are deferred until >> the task's "kernel_cs_count" hits 0. Currently this count is an integer >> to catch any cases where the count turns negative as a result of >> oversights on my part but this could be changed to a preempt count like >> mechanism to request a resched. >> >> cfs_rq throttled picked again >> v v >> >> ----|*********| (preempted by tick / wakeup) |***********| (full throttle) >> >> ^ ^ >> critical section cfs_rq is throttled partially critical section >> entry since the task is still exit >> runnable as it was preempted in >> kernel critical section >> >> The EEVDF infrastructure is extended to tracks the avg_vruntime and the >> avg_load of only those entities preempted in kernel mode. When a cfs_rq >> is throttled, it uses these metrics to select among the kernel mode >> preempted tasks and running them till they exit to user mode. >> pick_eevdf() is made aware that it is operating on a throttled hierarchy >> to only select among these tasks that were preempted in kernel mode (and >> the sched entities of cfs_rq that lead to them). When a throttled >> entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but >> the hierarchy remains accessible for picking until that point. >> >> root >> / \ >> A B * (throttled) >> ... / | \ >> 0 1* 2* >> >> (*) Preempted in kernel mode >> >> o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2) >> o avg_kcs_load = load(1) + load(2) >> >> o throttled_vruntime_eligible: >> >> entity preempted in kernel mode && >> entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime >> >> o rbtree is augmented with a "min_kcs_vruntime" field in sched entity >> that propagates the smallest vruntime of the all the entities in >> the subtree that are preempted in kernel mode. If they were >> executing in usermode when preempted, this will be set to LLONG_MAX. >> >> This is used to construct a min-heap and select through the >> entities. Consider rbtree of B to look like: >> >> 1* >> / \ >> 2* 0 >> >> min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX: >> min_kcs_vruntime = min(se->min_kcs_vruntime, >> __node_2_se(rb_left)->min_kcs_vruntime, >> __node_2_se(rb_right)->min_kcs_vruntime); >> >> pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted >> tree to first check the left subtree for eligibility, then the node >> itself, and then the right subtree. >> > > *groan*... why not actually dequeue the tasks and only retain those with > non-zero cs_count? That avoids having to duplicate everything, no? The rationale there was with growing number of tasks on cfs_rq, the throttle path has to perform a lot of dequeues and the unthrottle at distribution has to enqueue all the dequeued threads back. This is one way to keep all the tasks queued but allow pick to only select among those that are preempted in kernel mode. Since per-task throttling needs to tag, dequeue, and re-enqueue each task, I'm putting this out as an alternate approach that does not increase the complexities of tg_tree walks which Ben had noted on Valentin's series [1]. Instead we retain the per cfs_rq throttling at the cost of some stats tracking at enqueue and dequeue boundaries. If you have a strong feelings against any specific part, or the entirety of this approach, please do let me know, and I'll do my best to see if a tweaked approach or an alternate implementation can scale well with growing thread counts (or at least try to defend the bits in question if they hold merit still). Any and all feedback is appreciated :) [1] https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@google.com/ -- Thanks and Regards, Prateek
On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote: > The rationale there was with growing number of tasks on cfs_rq, the > throttle path has to perform a lot of dequeues and the unthrottle at > distribution has to enqueue all the dequeued threads back. > > This is one way to keep all the tasks queued but allow pick to only > select among those that are preempted in kernel mode. > > Since per-task throttling needs to tag, dequeue, and re-enqueue each > task, I'm putting this out as an alternate approach that does not > increase the complexities of tg_tree walks which Ben had noted on > Valentin's series [1]. Instead we retain the per cfs_rq throttling > at the cost of some stats tracking at enqueue and dequeue > boundaries. > > If you have a strong feelings against any specific part, or the entirety > of this approach, please do let me know, and I'll do my best to see if > a tweaked approach or an alternate implementation can scale well with > growing thread counts (or at least try to defend the bits in question if > they hold merit still). > > Any and all feedback is appreciated :) Pfff.. I hate it all :-) So the dequeue approach puts the pain on the people actually using the bandwidth crud, while this 'some extra accounting' crap has *everybody* pay for this nonsense, right? I'm not sure how bad this extra accounting is, but I do fear death by a thousand cuts.
On 20/02/25 12:32, Peter Zijlstra wrote: > On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote: > >> The rationale there was with growing number of tasks on cfs_rq, the >> throttle path has to perform a lot of dequeues and the unthrottle at >> distribution has to enqueue all the dequeued threads back. >> >> This is one way to keep all the tasks queued but allow pick to only >> select among those that are preempted in kernel mode. >> >> Since per-task throttling needs to tag, dequeue, and re-enqueue each >> task, I'm putting this out as an alternate approach that does not >> increase the complexities of tg_tree walks which Ben had noted on >> Valentin's series [1]. Instead we retain the per cfs_rq throttling >> at the cost of some stats tracking at enqueue and dequeue >> boundaries. >> >> If you have a strong feelings against any specific part, or the entirety >> of this approach, please do let me know, and I'll do my best to see if >> a tweaked approach or an alternate implementation can scale well with >> growing thread counts (or at least try to defend the bits in question if >> they hold merit still). >> >> Any and all feedback is appreciated :) > > Pfff.. I hate it all :-) > > So the dequeue approach puts the pain on the people actually using the > bandwidth crud, while this 'some extra accounting' crap has *everybody* > pay for this nonsense, right? > > I'm not sure how bad this extra accounting is, but I do fear death by a > thousand cuts. FWIW that was my main worry with the dual tree approach and why I gave up on it in favor of the per-task dequeue faff. Having the overhead mainly contained in throttle/unthrottle is a lot more attractive than adding (arguably small) overhead to the enqueue/dequeue paths. There was also the headache of figuring out what to do with the .*nr_running fields and what is reflected to load balance, which isn't an issue with the per-task thing. As pointed by Ben in [1], the issue with the per-task approach is the scalability of the unthrottle. You have the rq lock held and you potentially end up unthrottling a deep cgroup hierarchy, putting each individual task back on its cfs_rq. I can't find my notes on that in a hurry, but my idea with that for a next version was to periodically release the rq lock as we go up the cgroup hierarchy during unthrottle - the idea being that we can mess with part of hierarchy, and as long as that part isn't connected to the rest (i.e. it's not enqueued, like we currently do for CFS throttling), "it should be safe". FYI I haven't given up on this, it's just that repeatedly context switching between IPI deferral and this didn't really work for me so I'm sticking to one 'till it gets somewhere. [1]: https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@google.com/
On Thu, Feb 20, 2025 at 7:40 AM Valentin Schneider <vschneid@redhat.com> wrote: ... > As pointed by Ben in [1], the issue with the per-task approach is the > scalability of the unthrottle. You have the rq lock held and you > potentially end up unthrottling a deep cgroup hierarchy, putting each > individual task back on its cfs_rq. > > I can't find my notes on that in a hurry, but my idea with that for a next > version was to periodically release the rq lock as we go up the cgroup > hierarchy during unthrottle - the idea being that we can mess with part of > hierarchy, and as long as that part isn't connected to the rest (i.e. it's > not enqueued, like we currently do for CFS throttling), "it should be > safe". Can you elaborate a bit more? Even if we periodically release the lock, we're still spending quite a long time in non-preemptible kernel context, and unthrottle is also driven by an hrtimer. So we can still do quite a lot of damage depending on how long the whole loop takes.
On 20/02/25 17:47, Josh Don wrote: > On Thu, Feb 20, 2025 at 7:40 AM Valentin Schneider <vschneid@redhat.com> wrote: > ... >> As pointed by Ben in [1], the issue with the per-task approach is the >> scalability of the unthrottle. You have the rq lock held and you >> potentially end up unthrottling a deep cgroup hierarchy, putting each >> individual task back on its cfs_rq. >> >> I can't find my notes on that in a hurry, but my idea with that for a next >> version was to periodically release the rq lock as we go up the cgroup >> hierarchy during unthrottle - the idea being that we can mess with part of >> hierarchy, and as long as that part isn't connected to the rest (i.e. it's >> not enqueued, like we currently do for CFS throttling), "it should be >> safe". > > Can you elaborate a bit more? Even if we periodically release the > lock, we're still spending quite a long time in non-preemptible kernel > context, and unthrottle is also driven by an hrtimer. So we can still > do quite a lot of damage depending on how long the whole loop takes. Indeed, this only gives the rq lock a breather, but it doesn't help with preempt / irq off.
Hello Valentin, On 2/20/2025 9:10 PM, Valentin Schneider wrote: > On 20/02/25 12:32, Peter Zijlstra wrote: >> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote: >> >>> The rationale there was with growing number of tasks on cfs_rq, the >>> throttle path has to perform a lot of dequeues and the unthrottle at >>> distribution has to enqueue all the dequeued threads back. >>> >>> This is one way to keep all the tasks queued but allow pick to only >>> select among those that are preempted in kernel mode. >>> >>> Since per-task throttling needs to tag, dequeue, and re-enqueue each >>> task, I'm putting this out as an alternate approach that does not >>> increase the complexities of tg_tree walks which Ben had noted on >>> Valentin's series [1]. Instead we retain the per cfs_rq throttling >>> at the cost of some stats tracking at enqueue and dequeue >>> boundaries. >>> >>> If you have a strong feelings against any specific part, or the entirety >>> of this approach, please do let me know, and I'll do my best to see if >>> a tweaked approach or an alternate implementation can scale well with >>> growing thread counts (or at least try to defend the bits in question if >>> they hold merit still). >>> >>> Any and all feedback is appreciated :) >> >> Pfff.. I hate it all :-) >> >> So the dequeue approach puts the pain on the people actually using the >> bandwidth crud, while this 'some extra accounting' crap has *everybody* >> pay for this nonsense, right? >> >> I'm not sure how bad this extra accounting is, but I do fear death by a >> thousand cuts. > > FWIW that was my main worry with the dual tree approach and why I gave up > on it in favor of the per-task dequeue faff. Having the overhead mainly > contained in throttle/unthrottle is a lot more attractive than adding > (arguably small) overhead to the enqueue/dequeue paths. There was also the > headache of figuring out what to do with the .*nr_running fields and what > is reflected to load balance, which isn't an issue with the per-task thing. I believed that with the differentiation of nr_queued and nr_runnable now, the counts would be simpler to correct (I might be wrong). This approach retains the single rbtree but yes there is a cost associated with maintaining these stats. The stats collection can be deferred until a bandwidth constraint is first enforced but yes the small cost remains in every enqueue, dequeue, put_prev_entity, set_next_entity path thereafter. Arguably, this should be no more costlier than the current tracking of h_nr_delayed + min_slice in enqueue and dequeue paths but I might be wrong. > > As pointed by Ben in [1], the issue with the per-task approach is the > scalability of the unthrottle. You have the rq lock held and you > potentially end up unthrottling a deep cgroup hierarchy, putting each > individual task back on its cfs_rq. Agreed which is why this alternate approach to retain the throttling and unthrottling at cfs_rq level was worth a try. > > I can't find my notes on that in a hurry, but my idea with that for a next > version was to periodically release the rq lock as we go up the cgroup > hierarchy during unthrottle - the idea being that we can mess with part of > hierarchy, and as long as that part isn't connected to the rest (i.e. it's > not enqueued, like we currently do for CFS throttling), "it should be > safe". That is pretty nifty! My only concern there would be the case where a part of throttled hierarchy is still reachable on unthrottle but a part has dequeued itself - some tasks might have to wait until it is queued again to be accessible during the pick and a bunch of rescheds follow with each batch of enqueues. > > FYI I haven't given up on this, it's just that repeatedly context switching > between IPI deferral and this didn't really work for me so I'm sticking to > one 'till it gets somewhere. Ack! This RFC was to get feedback from folks and to see if there are any takers for cfs_rq level throttling and the reasons to move to a per-task throttling. Safe to say I'm slowly getting some answers :) > > [1]: https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@google.com/ > -- Thanks and Regards, Prateek
Hello Peter, On 2/20/2025 5:02 PM, Peter Zijlstra wrote: > On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote: > >> The rationale there was with growing number of tasks on cfs_rq, the >> throttle path has to perform a lot of dequeues and the unthrottle at >> distribution has to enqueue all the dequeued threads back. >> >> This is one way to keep all the tasks queued but allow pick to only >> select among those that are preempted in kernel mode. >> >> Since per-task throttling needs to tag, dequeue, and re-enqueue each >> task, I'm putting this out as an alternate approach that does not >> increase the complexities of tg_tree walks which Ben had noted on >> Valentin's series [1]. Instead we retain the per cfs_rq throttling >> at the cost of some stats tracking at enqueue and dequeue >> boundaries. >> >> If you have a strong feelings against any specific part, or the entirety >> of this approach, please do let me know, and I'll do my best to see if >> a tweaked approach or an alternate implementation can scale well with >> growing thread counts (or at least try to defend the bits in question if >> they hold merit still). >> >> Any and all feedback is appreciated :) > > Pfff.. I hate it all :-) > > So the dequeue approach puts the pain on the people actually using the > bandwidth crud, Back in Josh Don's presentation at the "Humongous Servers vs Kernel Scalability" BoF [1] at LPC'24, they mentioned one server handles around "O(250k) threads" (Slide 21) Assuming 256 logical CPUs from their first first couple of slides, that is about 1K potential tasks that can be throttled at one go on each CPU. Doing that within a single rq_lock critical section may take quite a bit of time. Is the expectation that these deployments have to be managed more smartly if we move to a per-task throttling model? Else it is just hard lockup by a thousand tasks. If Ben or Josh can comment on any scalability issues they might have seen on their deployment and any learning they have drawn from them since LPC'24, it would be great. Any stats on number of tasks that get throttled at one go would also be helpful. [1] https://lpc.events/event/18/contributions/1855/attachments/1436/3432/LPC%202024_%20Scalability%20BoF.pdf > while this 'some extra accounting' crap has *everybody* > pay for this nonsense, right? That is correct. Let me go and get some numbers to see if the overhead is visible but with deeper hierarchies there is a ton that goes on already which may hide these overheads. I'll try with different levels on a wakeup heavy task. > > I'm not sure how bad this extra accounting is, but I do fear death by a > thousand cuts. We surely don't want that! -- Thanks and Regards, Prateek
On Thu, Feb 20, 2025 at 4:04 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote: > > Hello Peter, > > On 2/20/2025 5:02 PM, Peter Zijlstra wrote: > > On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote: > >> Any and all feedback is appreciated :) > > > > Pfff.. I hate it all :-) > > > > So the dequeue approach puts the pain on the people actually using the > > bandwidth crud, while this 'some extra accounting' crap has *everybody* > > pay for this nonsense, right? Doing the context tracking could also provide benefit beyond CFS bandwidth. As an example, we often see a pattern where a thread acquires one mutex, then sleeps on trying to take a second mutex. When the thread eventually is woken due to the second mutex now being available, the thread now needs to wait to get back on cpu, which can take an arbitrary amount of time depending on where it landed in the tree, its weight, etc. Other threads trying to acquire that first mutex now experience priority inversion as they must wait for the original thread to get back on cpu and release the mutex. Re-using the same context tracking, we could prioritize execution of threads in kernel critical sections, even if they aren't the fair next choice. If that isn't convincing enough, we could certainly throw another kconfig or boot param for this behavior :) > Is the expectation that these deployments have to be managed more > smartly if we move to a per-task throttling model? Else it is just > hard lockup by a thousand tasks. +1, I don't see the per-task throttling being able to scale here. > If Ben or Josh can comment on any scalability issues they might have > seen on their deployment and any learning they have drawn from them > since LPC'24, it would be great. Any stats on number of tasks that > get throttled at one go would also be helpful. Maybe just to emphasize that we continue to see the same type of slowness; throttle/unthrottle when traversing a large cgroup sub-hierarchy is still an issue for us and we're working on sending a patch to ideally break this up to do the updates more lazily, as described at LPC. In particular, throttle/unthrottle (whether it be on a group basis or a per-task basis) is a loop that is subject to a lot of cache misses. Best, Josh
Hello Josh, Thank you for sharing the background! On 2/21/2025 7:34 AM, Josh Don wrote: > On Thu, Feb 20, 2025 at 4:04 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote: >> >> Hello Peter, >> >> On 2/20/2025 5:02 PM, Peter Zijlstra wrote: >>> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote: >>>> Any and all feedback is appreciated :) >>> >>> Pfff.. I hate it all :-) >>> >>> So the dequeue approach puts the pain on the people actually using the >>> bandwidth crud, while this 'some extra accounting' crap has *everybody* >>> pay for this nonsense, right? > > Doing the context tracking could also provide benefit beyond CFS > bandwidth. As an example, we often see a pattern where a thread > acquires one mutex, then sleeps on trying to take a second mutex. When > the thread eventually is woken due to the second mutex now being > available, the thread now needs to wait to get back on cpu, which can > take an arbitrary amount of time depending on where it landed in the > tree, its weight, etc. Other threads trying to acquire that first > mutex now experience priority inversion as they must wait for the > original thread to get back on cpu and release the mutex. Re-using the > same context tracking, we could prioritize execution of threads in > kernel critical sections, even if they aren't the fair next choice. Just out of curiosity, have you tried running with proxy-execution [1][2] on your deployments to mitigate priority inversion in mutexes? I've tested it with smaller scale benchmarks and I haven't seem much overhead except for in case of a few microbenchmarks but I'm not sure if you've run into any issues at your scale. [1] https://lore.kernel.org/lkml/20241125195204.2374458-1-jstultz@google.com/ [2] https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v14-6.13-rc1/ > > If that isn't convincing enough, we could certainly throw another > kconfig or boot param for this behavior :) > >> Is the expectation that these deployments have to be managed more >> smartly if we move to a per-task throttling model? Else it is just >> hard lockup by a thousand tasks. > > +1, I don't see the per-task throttling being able to scale here. > >> If Ben or Josh can comment on any scalability issues they might have >> seen on their deployment and any learning they have drawn from them >> since LPC'24, it would be great. Any stats on number of tasks that >> get throttled at one go would also be helpful. > > Maybe just to emphasize that we continue to see the same type of > slowness; throttle/unthrottle when traversing a large cgroup > sub-hierarchy is still an issue for us and we're working on sending a > patch to ideally break this up to do the updates more lazily, as > described at LPC. Is it possible to share an example hierarchy from one of your deployments? Your presentation for LPC'24 [1] says "O(1000) cgroups" but is it possible to reveal the kind of nesting you deal with and at which levels are bandwidth controls set. Even something like "O(10) cgroups on root with BW throttling set, and each of them contain O(100) cgroups below" could also help match a test setup. [2] https://lpc.events/event/18/contributions/1855/attachments/1436/3432/LPC%202024_%20Scalability%20BoF.pdf > > In particular, throttle/unthrottle (whether it be on a group basis or > a per-task basis) is a loop that is subject to a lot of cache misses. > > Best, > Josh -- Thanks and Regards, Prateek
On Thu, Feb 20, 2025 at 7:38 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote: ... > Just out of curiosity, have you tried running with proxy-execution [1][2] > on your deployments to mitigate priority inversion in mutexes? I've > tested it with smaller scale benchmarks and I haven't seem much overhead > except for in case of a few microbenchmarks but I'm not sure if you've > run into any issues at your scale. The confounding issue is that we see tail issues with other types of primitives, such as semaphores. That led us to trying an approach similar to yours with treating kernel-mode as a critical section from the perspective of e.g. CFSB. > Is it possible to share an example hierarchy from one of your > deployments? Your presentation for LPC'24 [1] says "O(1000) cgroups" but > is it possible to reveal the kind of nesting you deal with and at which > levels are bandwidth controls set. Even something like "O(10) cgroups on > root with BW throttling set, and each of them contain O(100) cgroups > below" could also help match a test setup. Sure, I can help shed some additional light. In terms of cgroup depth, we try to keep that fairly limited, given the cgroup depth scaling issues with task enqueue/dequeue. Max depth is maybe around ~5 depending on the exact job configuration, with an average closer to 2-3. However, width is quite large as we have many large dual socket machines that can handle hundreds of individual jobs (as I called out in the presentation, larger cpu count leads to more cgroups on the machine in order to fully utilize resources). The example I referred to in the presentation looks something like: root -> subtree_parent (this cgroup has CFSB enabled, period = 100ms) -> (~300-400 direct children, with some fraction having additional child cgroups, bringing total to O(1000)) Best, Josh
© 2016 - 2025 Red Hat, Inc.