sched/fair: Defer CFS throttling to exit to user mode

[RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Posted by K Prateek Nayak 11 months, 3 weeks ago

Hello everyone,

tl;dr

This is a spiritual successor to Valentin's series to defer CFS throttle
to user entry [1] however this does things differently: Instead of
moving to a per-task throttling, this retains the throttling at cfs_rq
level while tracking the tasks preempted in kernel mode queued on the
cfs_rq.

On a throttled hierarchy, only the kernel mode preempted entities are
considered runnable and the pick_task_fair() will only select among these
entities in EEVDF fashion using a min-heap approach (Patch 6-8)

This is an early prototype with few issues (please see "Broken bits and
disclaimer" section) however I'm putting it out to get some early
feedback.

This series is based on

    git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

at commit d34e798094ca ("sched/fair: Refactor can_migrate_task() to
elimate looping")

With that quick tl;dr sorted, let us jump right into it

Problem statement
=================

Following are Valentin's words taken from [1] with few changes to the
reference commits now that PREEMPT_RT has landed upstream:

CFS tasks can end up throttled while holding locks that other,
non-throttled tasks are blocking on.

For !PREEMPT_RT, this can be a source of latency due to the throttling
causing a resource acquisition denial.

For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter
  the slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)

If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued
on the same CPU as one where ktimers/ksoftirqd is blocked on
read_lock(&lock), this creates a circular dependency.

This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.

For PREEMPT_RT, the upstream kernel added commit 49a17639508c ("softirq:
Use a dedicated thread for timer wakeups on PREEMPT_RT.") which helped
this scenario for non-rwlock locks by ensuring the throttled task would
get PI'd to FIFO1 (ktimers' default priority). Unfortunately, rwlocks
cannot sanely do PI as they allow multiple readers.

Throttle deferral was discussed at OSPM'24 [2] and at LPC'24 [3] with
recordings for both available on Youtube.

Proposed approach
=================

This approach builds on Ben Segall's proposal in [4] which marked the
task in schedule() when exiting to usermode by setting
"in_return_to_user" flag except this prototype takes it a step ahead and
marks a "kernel critical section" within the syscall boundary using a
per-task "kernel_cs_count".

The rationale behind this approach is that the task can only hold
kernel resources when running in kernel mode in preemptible context. In
this POC, the entire syscall boundary is marked as a kernel critical
section but in the future, the API can be used to mark fine grained
boundaries like between an up_read(), down_read() or up_write(),
down_write() pair.

Within a kernel critical section, throttling events are deferred until
the task's "kernel_cs_count" hits 0. Currently this count is an integer
to catch any cases where the count turns negative as a result of
oversights on my part but this could be changed to a preempt count like
mechanism to request a resched.

            cfs_rq throttled               picked again
		  v                              v

    ----|*********| (preempted by tick / wakeup) |***********| (full throttle)

        ^                                                    ^
critical section   cfs_rq is throttled partially     critical section
      entry           since the task is still              exit
                  runnable as it was preempted in
                      kernel critical section

The EEVDF infrastructure is extended to tracks the avg_vruntime and the
avg_load of only those entities preempted in kernel mode. When a cfs_rq
is throttled, it uses these metrics to select among the kernel mode
preempted tasks and running them till they exit to user mode.
pick_eevdf() is made aware that it is operating on a throttled hierarchy
to only select among these tasks that were preempted in kernel mode (and
the sched entities of cfs_rq that lead to them). When a throttled
entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
the hierarchy remains accessible for picking until that point.

          root
        /     \
       A       B * (throttled)
      ...    / | \
            0  1* 2*

    (*) Preempted in kernel mode

  o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
  o avg_kcs_load = load(1) + load(2)

  o throttled_vruntime_eligible:

      entity preempted in kernel mode &&
      entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime

  o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
    that propagates the smallest vruntime of the all the entities in
    the subtree that are preempted in kernel mode. If they were
    executing in usermode when preempted, this will be set to LLONG_MAX.

    This is used to construct a min-heap and select through the
    entities. Consider rbtree of B to look like:

         1*
       /   \
      2*    0

      min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
      min_kcs_vruntime = min(se->min_kcs_vruntime,
                             __node_2_se(rb_left)->min_kcs_vruntime,
                             __node_2_se(rb_right)->min_kcs_vruntime);

   pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
   tree to first check the left subtree for eligibility, then the node
   itself, and then the right subtree.

With proactive tracking, throttle deferral can retain throttling at per
cfs_rq granularity instead of moving to a per-task model.

On the throttling side, a third throttle state is introduced
CFS_THROTTLED_PARTIAL which indicates that the cfs_rq has run out of its
bandwidth but contains runnable (kernel mode preempted) entities. A
partially throttled cfs_rq is added to the throttled list so it can be
unthrottled in a timely manner during distribution making all the tasks
queued on it runnable again. Throttle status can be promoted or demoted
depending on whether a kernel mode preempted tasks exits to user mode,
blocks, or wakes up on the hierarchy.

The most tricky part of the series is handling of the current task which
is kept out of kernel mode status tracking since a running task can make
multiple syscall and propagating those signals up the cgroup hierarchy
can be expensive. Instead the status of the current task is only looked
at during schedule() to make throttling decisions. This leads to some
interesting considerations in some paths (see curr_h_is_throttled() in
Patch 8 for EEVDF handling)

Testing
=======

The series was tested for correctness using a single hackbench instance
in a bandwidth controlled cgroup to observe the following set of events:

  o Throttle into partial state followed by a full throttle

   hackbench-1602    [044] dN.2.    55.684108: throttle_cfs_rq: cfs_rq throttled: (0) -> (1)
   hackbench-1602    [044] dN.2.    55.684111: pick_task_fair: Task picked on throttled hierarchy: hackbench(1602)
   hackbench-1602    [044] d....    55.684117: sched_notify_critical_section_exit: Pick on throttled requesting resched
   hackbench-1602    [044] dN.2.    55.684119: throttle_cfs_rq: cfs_rq throttled: (1) -> (2)
   ...
      <idle>-0       [044] dNh2.    55.689677: unthrottle_cfs_rq: cfs_rq unthrottled: (2) -> (0)

  o Full throttle demoted to partial and then reupgraded

      <idle>-0       [006] dN.2.    55.592145: unthrottle_cfs_rq: cfs_rq unthrottled: (2) -> (1)
      <idle>-0       [006] dN.2.    55.592147: pick_task_fair: Task picked on throttled hierarchy: hackbench(1584)
   ...
   hackbench-1584    [006] d....    55.592154: sched_notify_critical_section_exit: Pick on throttled requesting resched
   ...
   hackbench-1584    [006] dN.2.    55.592157: throttle_cfs_rq: cfs_rq throttled: (1) -> (2)
 
  [ Note: (1) corresponds to CFS_THROTTLE_PARTIAL (throttled but with
    kernel mode preempted entities that are still runnable) and (2)
    corresponds to CFS_THROTTLED (full throttle with no runnable task) ]

The stability testing was done by running the following script:

    #!/bin/bash
    
    mkdir /sys/fs/cgroup/CG0
    echo "+cpu" > /sys/fs/cgroup/CG0/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/cgroup.subtree_control
    mkdir /sys/fs/cgroup/CG0/CG00
    mkdir /sys/fs/cgroup/CG0/CG01
    echo "+cpu" > /sys/fs/cgroup/CG0/CG00/cgroup.subtree_control
    echo "+cpu" > /sys/fs/cgroup/CG0/CG01/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG01/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG00/cgroup.subtree_control
    mkdir /sys/fs/cgroup/CG0/CG00/CG000
    mkdir /sys/fs/cgroup/CG0/CG00/CG001
    mkdir /sys/fs/cgroup/CG0/CG01/CG000
    mkdir /sys/fs/cgroup/CG0/CG01/CG001
    echo "+cpu" > /sys/fs/cgroup/CG0/CG00/CG001/cgroup.subtree_control
    echo "+cpu" > /sys/fs/cgroup/CG0/CG00/CG000/cgroup.subtree_control
    echo "+cpu" > /sys/fs/cgroup/CG0/CG01/CG000/cgroup.subtree_control
    echo "+cpu" > /sys/fs/cgroup/CG0/CG01/CG001/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG01/CG000/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG01/CG001/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG00/CG000/cgroup.subtree_control
    echo "+cpuset" > /sys/fs/cgroup/CG0/CG00/CG001/cgroup.subtree_control
    echo "3200000 100000" > /sys/fs/cgroup/CG0/cpu.max
    echo "1600000 100000" > /sys/fs/cgroup/CG0/CG00/cpu.max
    echo "1600000 100000" > /sys/fs/cgroup/CG0/CG01/cpu.max
    echo "800000 100000" > /sys/fs/cgroup/CG0/CG00/CG000/cpu.max
    echo "800000 100000" > /sys/fs/cgroup/CG0/CG00/CG001/cpu.max
    echo "800000 100000" > /sys/fs/cgroup/CG0/CG01/CG000/cpu.max
    echo "800000 100000" > /sys/fs/cgroup/CG0/CG01/CG001/cpu.max
    
    sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG00/CG000/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
    sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG00/CG001/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
    sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG01/CG000/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
    sudo su -c "bash -c \"echo \$\$ > /sys/fs/cgroup/CG0/CG01/CG001/cgroup.procs; hackbench -p -l 5000000 -g $1 -T& wait;\""&
    
    time wait;

---

The above script was stress tested with 1,2,4, and 8 groups in a 64vCPU
VM, and with upto 64 groups on a 3rd Generation EPYC system (2 x
64C/128T) without experiencing any obvious issues.

Since I had run into various issues initially with this prototype, I've
retained a debug patch towards the end that dumps the rbtree state when
pick_eevdf() fails to select a task. Since it is a debug patch, all the
checkpatch warnings and errors have been ignored. I apologize in advance
for any inconvenience there.

Broken bits and disclaimer
==========================

In ascending order of priority:

- Few SCHED_WARN_ON() added assume only the syscall boundary is a
  critical section. These can go if this infrastructure is used for more
  fine-grained deferral.

- !CONFIG_CFS_BANDWIDTH and !CONFIG_SCHED_AUTOGROUP has only been build
  tested. Most (almost all) of the testing was done with
  CONFIG_CFS_BANDWIDTH enabled on a x86 system and a VM.

- The current PoC only support achitectures that select GENERIC_ENTRY.
  Other architectures need to use the sched_notify_critical_section*()
  API to mark the kernel critical section boundaries.

- Some abstractions are not very clean. For example, generic layer uses
  se_in_kernel() often passing &p->se as the argument. Instead, the
  task_struct can be sent directly to a wrapper around se_in_kernel().

- There is currently no way to opt out of throttle deferral however, it
  is possible to introduce a sysctl hook to enable / disable the feature
  on demand by only allowing propagation of "kernel_cs_count" down the
  cgroup hierarchy when the feature is enabled and a static key to guard
  key decision points that can lead to partial throttle.A simpler option
  is a boot time switch.

- On a partially throttled hierarchy, only kernel mode preempted tasks
  and the entities that lead to them can make progress. This can
  theoretically lead them to accumulate unbounded amount of -ve lag
  which can lead to long wait times for these entities when the
  hierarchy is unthrottled.

- PELT requires taking some look at since the number of runnable
  entities on a throttled hierarchy is not the same as "h_nr_runnable".
  Audit all references to "cfs_rq->h_nr_runnable" and "se->runnable_sum"
  and address those paths accordingly.

- Partially throttled hierarchies do not set the hierarchical indicators
  namely "cfs_rq->throttled_count" anymore. Audit all uses of
  throttled_hierarchy() to see if they need adjusting. At least two
  cases, one in throttled_lb_pair(), and the other in
  class->task_is_throttled() callback used by core scheduling requires
  auditing.

Clarifications required
=======================

- A kernel thread on a throttled hierarchy will always partially
  unthrottle it and make itself runnable. Is this possible? (at least
  one case with osnoise was found) Is this desired?

- A new task always starts off in kernel mode and can unthrottle the
  hierarchy to make itself runnable. Is this a desired behavior?

- There are a few "XXX" and "TODO" along that way that I'll be
  revisiting. Any comments there is greatly appreciated.

- Patch 21 seems to solve a heap integrity issue that I have
  encountered. Late into stress testing I occasionally came across a
  scenario where the rbroot does not have the correct min_kcs_vruntime
  set. An example dump is as follows:

    (Sorry in advance for the size of this dump)

    CPU43: ----- current task ----
    CPU43: pid(0) comm(swapper/43) task_cpu(43) task_on_rq_queued(1) task_on_rq_migrating(0) normal_policy(1) idle_policy(0)
    CPU43: ----- current task done ----
    CPU43: ----- cfs_rq ----
    CPU43: cfs_rq: throttled?(1) cfs_rq->throttled(1) h_nr_queued(24) h_nr_runnable(24) nr_queued(24) gse->kernel_cs_count(1)
    CPU43: cfs_rq EEVDF: avg_vruntime(3043328) avg_load(24576) avg_kcs_vruntime(125952) avg_kcs_load(1024)
    CPU43: ----- cfs_rq done ----
    CPU43: ----- rbtree traversal: root ----
    CPU43: se: load(1024) vruntime(6057209863) entity_key(12716) deadline(6059998121) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057184139) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057198486) entity_key(1339) deadline(6059991924) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057184139) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057128111) entity_key(-69036) deadline(6059919846) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057184139) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057202668) entity_key(5521) deadline(6059894986) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057124320) entity_key(-72827) deadline(6059915644) min_vruntime(6057124320) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057196261) entity_key(-886) deadline(6059984139) min_vruntime(6057176590) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057176590) entity_key(-20557) deadline(6059967663) min_vruntime(6057176590) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057198406) entity_key(1259) deadline(6059990321) min_vruntime(6057198406) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057206114) entity_key(8967) deadline(6059995624) min_vruntime(6057197270) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057197270) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057202212) entity_key(5065) deadline(6059994879) min_vruntime(6057197431) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057201766) entity_key(4619) deadline(6059993029) min_vruntime(6057197431) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057197431) entity_key(284) deadline(6059992361) min_vruntime(6057197431) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057201845) entity_key(4698) deadline(6059994381) min_vruntime(6057201845) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057201956) entity_key(4809) deadline(6059995233) min_vruntime(6057201956) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057203193) entity_key(6046) deadline(6059996320) min_vruntime(6057197270) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057197270) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057199892) entity_key(2745) deadline(6059995624) min_vruntime(6057199892) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057205914) entity_key(8767) deadline(6059996907) min_vruntime(6057197270) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(6057197270) pick_entity(0)
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057197270) entity_key(123) deadline(6059997270) min_vruntime(6057197270) on_rq(1)
    CPU43: se kcs: kernel_cs_count(1) min_kcs_vruntime(6057197270) pick_entity(1)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057210834) entity_key(13687) deadline(6060002729) min_vruntime(6057209600) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057209600) entity_key(12453) deadline(6060000733) min_vruntime(6057209600) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057215347) entity_key(18200) deadline(6060005989) min_vruntime(6057215103) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree ----
    CPU43: se: load(1024) vruntime(6057215103) entity_key(17956) deadline(6060004643) min_vruntime(6057215103) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Left Subtree Done ----
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057215276) entity_key(18129) deadline(6060007381) min_vruntime(6057215276) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree ----
    CPU43: se: load(1024) vruntime(6057216042) entity_key(18895) deadline(6060008258) min_vruntime(6057216042) on_rq(1)
    CPU43: se kcs: kernel_cs_count(0) min_kcs_vruntime(9223372036854775807) pick_entity(0)
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- Right Subtree Done ----
    CPU43: ----- rbtree done ----
    CPU43: ----- parent cfs_rq ----
    CPU43: ----- cfs_rq ----
    CPU43: cfs_rq: throttled?(1) cfs_rq->throttled(0) h_nr_queued(24) h_nr_runnable(24) nr_queued(1) gse->kernel_cs_count(1)
    CPU43: cfs_rq EEVDF: avg_vruntime(0) avg_load(312) avg_kcs_vruntime(0) avg_kcs_load(312)
    CPU43: ----- cfs_rq done ----
    CPU43: ----- parent cfs_rq ----
    CPU43: ----- cfs_rq ----
    CPU43: cfs_rq: throttled?(1) cfs_rq->throttled(0) h_nr_queued(24) h_nr_runnable(24) nr_queued(1) gse->kernel_cs_count(1)
    CPU43: cfs_rq EEVDF: avg_vruntime(0) avg_load(206) avg_kcs_vruntime(0) avg_kcs_load(206)
    CPU43: ----- cfs_rq done ----
    CPU43: ----- cfs_rq ----
    CPU43: cfs_rq: throttled?(0) cfs_rq->throttled(0) h_nr_queued(24) h_nr_runnable(24) nr_queued(1) gse->kernel_cs_count(-1)
    CPU43: cfs_rq EEVDF: avg_vruntime(0) avg_load(156) avg_kcs_vruntime(0) avg_kcs_load(156)
    CPU43: ----- cfs_rq done ----

  The vruntime of entity corresponding to "pick_entity(1)" should have
  been propagated to the root as min_kcs_vruntime but that does not seem
  to happen. This happens unpredictably and the amount of logs are
  overwhelming to make any sense of which series of rbtree operations
  lead up to it but grouping all the min_* members and setting it as
  RBAUGMENT as implemented in Patch 21 seem to make the issue go away.

  If folks think this is a genuine bugfix and not me masking a bigger
  issue, I can send this fix separately for min_vruntime and min_slice
  propagation and rebase my series on top of it.

Credits
=======

This series is based on the former approaches from:

Valentin Schneider <vschneid@redhat.com>
Ben Segall <bsegall@google.com>

All the credit for this idea really is due to them - while the mistakes
in implementing it are likely mine :)

Thanks to Swapnil, Dhananjay, Ravi, and Gautham for helping devise the
test strategy, discussions around the intricacies of CPU controllers,
and their help at tackling multiple subtle bugs I encountered along the
way.

References
==========

[1]: https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
[2]: https://youtu.be/YAVxQKQABLw
[3]: https://youtu.be/_N-nXJHiDNo
[4]: http://lore.kernel.org/r/xm26edfxpock.fsf@bsegall-linux.svl.corp.google.com

---
K Prateek Nayak (22):
  kernel/entry/common: Move syscall_enter_from_user_mode_work() out of
    header
  sched/fair: Convert "se->runnable_weight" to unsigned int and pack the
    struct
  [PoC] kernel/entry/common: Mark syscall as a kernel critical section
  [PoC] kernel/sched: Inititalize "kernel_cs_count" for new tasks
  sched/fair: Track EEVDF stats for entities preempted in kernel mode
  sched/fair: Propagate the min_vruntime of kernel mode preempted entity
  sched/fair: Propagate preempted entity information up cgroup hierarchy
  sched/fair: Allow pick_eevdf() to pick in-kernel entities on throttled
    hierarchy
  sched/fair: Introduce cfs_rq throttled states in preparation for
    partial throttling
  sched/fair: Prepare throttle_cfs_rq() to allow partial throttling
  sched/fair: Prepare unthrottle_cfs_rq() to demote throttle status
  sched/fair: Prepare bandwidth distribution to unthrottle partial
    throttles right away
  sched/fair: Correct the throttle status supplied to pick_eevdf()
  sched/fair: Mark a task if it was picked on a partially throttled
    hierarchy
  sched/fair: Call resched_curr() from sched_notify_syscall_exit()
  sched/fair: Prepare enqueue to partially unthrottle cfs_rq
  sched/fair: Clear pick on throttled indicator when task leave fair
    class
  sched/fair: Prepare pick_next_task_fair() to unthrottle a throttled
    hierarchy
  sched/fair: Ignore in-kernel indicators for running task outside of
    schedule()
  sched/fair: Implement determine_throttle_state() for partial throttle
  [MAYBE BUGFIX] sched/fair: Group all the se->min_* members together
    for propagation
  [DEBUG] sched/fair: Debug pick_eevdf() returning NULL!

 include/linux/entry-common.h |  10 +-
 include/linux/sched.h        |  45 +-
 init/init_task.c             |   5 +-
 kernel/entry/common.c        |  17 +
 kernel/entry/common.h        |   4 +
 kernel/sched/core.c          |   8 +-
 kernel/sched/debug.c         |   2 +-
 kernel/sched/fair.c          | 935 ++++++++++++++++++++++++++++++++---
 kernel/sched/sched.h         |   8 +-
 9 files changed, 950 insertions(+), 84 deletions(-)


base-commit: d34e798094ca7be935b629a42f8b237d4d5b7f1d
-- 
2.43.0

Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Posted by Peter Zijlstra 11 months, 3 weeks ago

On Thu, Feb 20, 2025 at 09:32:35AM +0000, K Prateek Nayak wrote:
> Proposed approach
> =================
> 
> This approach builds on Ben Segall's proposal in [4] which marked the
> task in schedule() when exiting to usermode by setting
> "in_return_to_user" flag except this prototype takes it a step ahead and
> marks a "kernel critical section" within the syscall boundary using a
> per-task "kernel_cs_count".
> 
> The rationale behind this approach is that the task can only hold
> kernel resources when running in kernel mode in preemptible context. In
> this POC, the entire syscall boundary is marked as a kernel critical
> section but in the future, the API can be used to mark fine grained
> boundaries like between an up_read(), down_read() or up_write(),
> down_write() pair.
> 
> Within a kernel critical section, throttling events are deferred until
> the task's "kernel_cs_count" hits 0. Currently this count is an integer
> to catch any cases where the count turns negative as a result of
> oversights on my part but this could be changed to a preempt count like
> mechanism to request a resched.
> 
>             cfs_rq throttled               picked again
> 		  v                              v
> 
>     ----|*********| (preempted by tick / wakeup) |***********| (full throttle)
> 
>         ^                                                    ^
> critical section   cfs_rq is throttled partially     critical section
>       entry           since the task is still              exit
>                   runnable as it was preempted in
>                       kernel critical section
> 
> The EEVDF infrastructure is extended to tracks the avg_vruntime and the
> avg_load of only those entities preempted in kernel mode. When a cfs_rq
> is throttled, it uses these metrics to select among the kernel mode
> preempted tasks and running them till they exit to user mode.
> pick_eevdf() is made aware that it is operating on a throttled hierarchy
> to only select among these tasks that were preempted in kernel mode (and
> the sched entities of cfs_rq that lead to them). When a throttled
> entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
> the hierarchy remains accessible for picking until that point.
> 
>           root
>         /     \
>        A       B * (throttled)
>       ...    / | \
>             0  1* 2*
> 
>     (*) Preempted in kernel mode
> 
>   o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
>   o avg_kcs_load = load(1) + load(2)
> 
>   o throttled_vruntime_eligible:
> 
>       entity preempted in kernel mode &&
>       entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime
> 
>   o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
>     that propagates the smallest vruntime of the all the entities in
>     the subtree that are preempted in kernel mode. If they were
>     executing in usermode when preempted, this will be set to LLONG_MAX.
> 
>     This is used to construct a min-heap and select through the
>     entities. Consider rbtree of B to look like:
> 
>          1*
>        /   \
>       2*    0
> 
>       min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
>       min_kcs_vruntime = min(se->min_kcs_vruntime,
>                              __node_2_se(rb_left)->min_kcs_vruntime,
>                              __node_2_se(rb_right)->min_kcs_vruntime);
> 
>    pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
>    tree to first check the left subtree for eligibility, then the node
>    itself, and then the right subtree.
> 

*groan*... why not actually dequeue the tasks and only retain those with
non-zero cs_count? That avoids having to duplicate everything, no?

Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Posted by K Prateek Nayak 11 months, 3 weeks ago

Hello Peter,

On 2/20/2025 4:25 PM, Peter Zijlstra wrote:
> On Thu, Feb 20, 2025 at 09:32:35AM +0000, K Prateek Nayak wrote:
>> Proposed approach
>> =================
>>
>> This approach builds on Ben Segall's proposal in [4] which marked the
>> task in schedule() when exiting to usermode by setting
>> "in_return_to_user" flag except this prototype takes it a step ahead and
>> marks a "kernel critical section" within the syscall boundary using a
>> per-task "kernel_cs_count".
>>
>> The rationale behind this approach is that the task can only hold
>> kernel resources when running in kernel mode in preemptible context. In
>> this POC, the entire syscall boundary is marked as a kernel critical
>> section but in the future, the API can be used to mark fine grained
>> boundaries like between an up_read(), down_read() or up_write(),
>> down_write() pair.
>>
>> Within a kernel critical section, throttling events are deferred until
>> the task's "kernel_cs_count" hits 0. Currently this count is an integer
>> to catch any cases where the count turns negative as a result of
>> oversights on my part but this could be changed to a preempt count like
>> mechanism to request a resched.
>>
>>              cfs_rq throttled               picked again
>> 		  v                              v
>>
>>      ----|*********| (preempted by tick / wakeup) |***********| (full throttle)
>>
>>          ^                                                    ^
>> critical section   cfs_rq is throttled partially     critical section
>>        entry           since the task is still              exit
>>                    runnable as it was preempted in
>>                        kernel critical section
>>
>> The EEVDF infrastructure is extended to tracks the avg_vruntime and the
>> avg_load of only those entities preempted in kernel mode. When a cfs_rq
>> is throttled, it uses these metrics to select among the kernel mode
>> preempted tasks and running them till they exit to user mode.
>> pick_eevdf() is made aware that it is operating on a throttled hierarchy
>> to only select among these tasks that were preempted in kernel mode (and
>> the sched entities of cfs_rq that lead to them). When a throttled
>> entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
>> the hierarchy remains accessible for picking until that point.
>>
>>            root
>>          /     \
>>         A       B * (throttled)
>>        ...    / | \
>>              0  1* 2*
>>
>>      (*) Preempted in kernel mode
>>
>>    o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
>>    o avg_kcs_load = load(1) + load(2)
>>
>>    o throttled_vruntime_eligible:
>>
>>        entity preempted in kernel mode &&
>>        entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime
>>
>>    o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
>>      that propagates the smallest vruntime of the all the entities in
>>      the subtree that are preempted in kernel mode. If they were
>>      executing in usermode when preempted, this will be set to LLONG_MAX.
>>
>>      This is used to construct a min-heap and select through the
>>      entities. Consider rbtree of B to look like:
>>
>>           1*
>>         /   \
>>        2*    0
>>
>>        min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
>>        min_kcs_vruntime = min(se->min_kcs_vruntime,
>>                               __node_2_se(rb_left)->min_kcs_vruntime,
>>                               __node_2_se(rb_right)->min_kcs_vruntime);
>>
>>     pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
>>     tree to first check the left subtree for eligibility, then the node
>>     itself, and then the right subtree.
>>
> 
> *groan*... why not actually dequeue the tasks and only retain those with
> non-zero cs_count? That avoids having to duplicate everything, no?

The rationale there was with growing number of tasks on cfs_rq, the
throttle path has to perform a lot of dequeues and the unthrottle at
distribution has to enqueue all the dequeued threads back.

This is one way to keep all the tasks queued but allow pick to only
select among those that are preempted in kernel mode.

Since per-task throttling needs to tag, dequeue, and re-enqueue each
task, I'm putting this out as an alternate approach that does not
increase the complexities of tg_tree walks which Ben had noted on
Valentin's series [1]. Instead we retain the per cfs_rq throttling
at the cost of some stats tracking at enqueue and dequeue
boundaries.

If you have a strong feelings against any specific part, or the entirety
of this approach, please do let me know, and I'll do my best to see if
a tweaked approach or an alternate implementation can scale well with
growing thread counts (or at least try to defend the bits in question if
they hold merit still).

Any and all feedback is appreciated :)

[1] https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@google.com/

-- 
Thanks and Regards,
Prateek

Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Posted by Peter Zijlstra 11 months, 3 weeks ago

On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:

> The rationale there was with growing number of tasks on cfs_rq, the
> throttle path has to perform a lot of dequeues and the unthrottle at
> distribution has to enqueue all the dequeued threads back.
> 
> This is one way to keep all the tasks queued but allow pick to only
> select among those that are preempted in kernel mode.
> 
> Since per-task throttling needs to tag, dequeue, and re-enqueue each
> task, I'm putting this out as an alternate approach that does not
> increase the complexities of tg_tree walks which Ben had noted on
> Valentin's series [1]. Instead we retain the per cfs_rq throttling
> at the cost of some stats tracking at enqueue and dequeue
> boundaries.
> 
> If you have a strong feelings against any specific part, or the entirety
> of this approach, please do let me know, and I'll do my best to see if
> a tweaked approach or an alternate implementation can scale well with
> growing thread counts (or at least try to defend the bits in question if
> they hold merit still).
> 
> Any and all feedback is appreciated :)

Pfff.. I hate it all :-)

So the dequeue approach puts the pain on the people actually using the
bandwidth crud, while this 'some extra accounting' crap has *everybody*
pay for this nonsense, right?

I'm not sure how bad this extra accounting is, but I do fear death by a
thousand cuts.

Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Posted by Valentin Schneider 11 months, 3 weeks ago

On 20/02/25 12:32, Peter Zijlstra wrote:
> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
>
>> The rationale there was with growing number of tasks on cfs_rq, the
>> throttle path has to perform a lot of dequeues and the unthrottle at
>> distribution has to enqueue all the dequeued threads back.
>>
>> This is one way to keep all the tasks queued but allow pick to only
>> select among those that are preempted in kernel mode.
>>
>> Since per-task throttling needs to tag, dequeue, and re-enqueue each
>> task, I'm putting this out as an alternate approach that does not
>> increase the complexities of tg_tree walks which Ben had noted on
>> Valentin's series [1]. Instead we retain the per cfs_rq throttling
>> at the cost of some stats tracking at enqueue and dequeue
>> boundaries.
>>
>> If you have a strong feelings against any specific part, or the entirety
>> of this approach, please do let me know, and I'll do my best to see if
>> a tweaked approach or an alternate implementation can scale well with
>> growing thread counts (or at least try to defend the bits in question if
>> they hold merit still).
>>
>> Any and all feedback is appreciated :)
>
> Pfff.. I hate it all :-)
>
> So the dequeue approach puts the pain on the people actually using the
> bandwidth crud, while this 'some extra accounting' crap has *everybody*
> pay for this nonsense, right?
>
> I'm not sure how bad this extra accounting is, but I do fear death by a
> thousand cuts.

FWIW that was my main worry with the dual tree approach and why I gave up
on it in favor of the per-task dequeue faff. Having the overhead mainly
contained in throttle/unthrottle is a lot more attractive than adding
(arguably small) overhead to the enqueue/dequeue paths. There was also the
headache of figuring out what to do with the .*nr_running fields and what
is reflected to load balance, which isn't an issue with the per-task thing.

As pointed by Ben in [1], the issue with the per-task approach is the
scalability of the unthrottle. You have the rq lock held and you
potentially end up unthrottling a deep cgroup hierarchy, putting each
individual task back on its cfs_rq.

I can't find my notes on that in a hurry, but my idea with that for a next
version was to periodically release the rq lock as we go up the cgroup
hierarchy during unthrottle - the idea being that we can mess with part of
hierarchy, and as long as that part isn't connected to the rest (i.e. it's
not enqueued, like we currently do for CFS throttling), "it should be
safe".

FYI I haven't given up on this, it's just that repeatedly context switching
between IPI deferral and this didn't really work for me so I'm sticking to
one 'till it gets somewhere.

[1]: https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@google.com/

Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Posted by Josh Don 11 months, 3 weeks ago

On Thu, Feb 20, 2025 at 7:40 AM Valentin Schneider <vschneid@redhat.com> wrote:
...
> As pointed by Ben in [1], the issue with the per-task approach is the
> scalability of the unthrottle. You have the rq lock held and you
> potentially end up unthrottling a deep cgroup hierarchy, putting each
> individual task back on its cfs_rq.
>
> I can't find my notes on that in a hurry, but my idea with that for a next
> version was to periodically release the rq lock as we go up the cgroup
> hierarchy during unthrottle - the idea being that we can mess with part of
> hierarchy, and as long as that part isn't connected to the rest (i.e. it's
> not enqueued, like we currently do for CFS throttling), "it should be
> safe".

Can you elaborate a bit more? Even if we periodically release the
lock, we're still spending quite a long time in non-preemptible kernel
context, and unthrottle is also driven by an hrtimer. So we can still
do quite a lot of damage depending on how long the whole loop takes.

Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Posted by Valentin Schneider 11 months, 2 weeks ago

On 20/02/25 17:47, Josh Don wrote:
> On Thu, Feb 20, 2025 at 7:40 AM Valentin Schneider <vschneid@redhat.com> wrote:
> ...
>> As pointed by Ben in [1], the issue with the per-task approach is the
>> scalability of the unthrottle. You have the rq lock held and you
>> potentially end up unthrottling a deep cgroup hierarchy, putting each
>> individual task back on its cfs_rq.
>>
>> I can't find my notes on that in a hurry, but my idea with that for a next
>> version was to periodically release the rq lock as we go up the cgroup
>> hierarchy during unthrottle - the idea being that we can mess with part of
>> hierarchy, and as long as that part isn't connected to the rest (i.e. it's
>> not enqueued, like we currently do for CFS throttling), "it should be
>> safe".
>
> Can you elaborate a bit more? Even if we periodically release the
> lock, we're still spending quite a long time in non-preemptible kernel
> context, and unthrottle is also driven by an hrtimer. So we can still
> do quite a lot of damage depending on how long the whole loop takes.

Indeed, this only gives the rq lock a breather, but it doesn't help with
preempt / irq off.

Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Posted by K Prateek Nayak 11 months, 3 weeks ago

Hello Valentin,

On 2/20/2025 9:10 PM, Valentin Schneider wrote:
> On 20/02/25 12:32, Peter Zijlstra wrote:
>> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
>>
>>> The rationale there was with growing number of tasks on cfs_rq, the
>>> throttle path has to perform a lot of dequeues and the unthrottle at
>>> distribution has to enqueue all the dequeued threads back.
>>>
>>> This is one way to keep all the tasks queued but allow pick to only
>>> select among those that are preempted in kernel mode.
>>>
>>> Since per-task throttling needs to tag, dequeue, and re-enqueue each
>>> task, I'm putting this out as an alternate approach that does not
>>> increase the complexities of tg_tree walks which Ben had noted on
>>> Valentin's series [1]. Instead we retain the per cfs_rq throttling
>>> at the cost of some stats tracking at enqueue and dequeue
>>> boundaries.
>>>
>>> If you have a strong feelings against any specific part, or the entirety
>>> of this approach, please do let me know, and I'll do my best to see if
>>> a tweaked approach or an alternate implementation can scale well with
>>> growing thread counts (or at least try to defend the bits in question if
>>> they hold merit still).
>>>
>>> Any and all feedback is appreciated :)
>>
>> Pfff.. I hate it all :-)
>>
>> So the dequeue approach puts the pain on the people actually using the
>> bandwidth crud, while this 'some extra accounting' crap has *everybody*
>> pay for this nonsense, right?
>>
>> I'm not sure how bad this extra accounting is, but I do fear death by a
>> thousand cuts.
> 
> FWIW that was my main worry with the dual tree approach and why I gave up
> on it in favor of the per-task dequeue faff. Having the overhead mainly
> contained in throttle/unthrottle is a lot more attractive than adding
> (arguably small) overhead to the enqueue/dequeue paths. There was also the
> headache of figuring out what to do with the .*nr_running fields and what
> is reflected to load balance, which isn't an issue with the per-task thing.

I believed that with the differentiation of nr_queued and nr_runnable
now, the counts would be simpler to correct (I might be wrong).

This approach retains the single rbtree but yes there is a cost
associated with maintaining these stats. The stats collection can be
deferred until a bandwidth constraint is first enforced but yes the
small cost remains in every enqueue, dequeue, put_prev_entity,
set_next_entity path thereafter.

Arguably, this should be no more costlier than the current tracking of
h_nr_delayed + min_slice in enqueue and dequeue paths but I might be
wrong.

> 
> As pointed by Ben in [1], the issue with the per-task approach is the
> scalability of the unthrottle. You have the rq lock held and you
> potentially end up unthrottling a deep cgroup hierarchy, putting each
> individual task back on its cfs_rq.

Agreed which is why this alternate approach to retain the throttling and
unthrottling at cfs_rq level was worth a try.

> 
> I can't find my notes on that in a hurry, but my idea with that for a next
> version was to periodically release the rq lock as we go up the cgroup
> hierarchy during unthrottle - the idea being that we can mess with part of
> hierarchy, and as long as that part isn't connected to the rest (i.e. it's
> not enqueued, like we currently do for CFS throttling), "it should be
> safe".

That is pretty nifty! My only concern there would be the case where a
part of throttled hierarchy is still reachable on unthrottle but a part
has dequeued itself - some tasks might have to wait until it is queued
again to be accessible during the pick and a bunch of rescheds follow
with each batch of enqueues.

> 
> FYI I haven't given up on this, it's just that repeatedly context switching
> between IPI deferral and this didn't really work for me so I'm sticking to
> one 'till it gets somewhere.

Ack! This RFC was to get feedback from folks and to see if there are
any takers for cfs_rq level throttling and the reasons to move to a
per-task throttling. Safe to say I'm slowly getting some answers :)

> 
> [1]: https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@google.com/
> 

-- 
Thanks and Regards,
Prateek

Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Posted by K Prateek Nayak 11 months, 3 weeks ago

Hello Peter,

On 2/20/2025 5:02 PM, Peter Zijlstra wrote:
> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
> 
>> The rationale there was with growing number of tasks on cfs_rq, the
>> throttle path has to perform a lot of dequeues and the unthrottle at
>> distribution has to enqueue all the dequeued threads back.
>>
>> This is one way to keep all the tasks queued but allow pick to only
>> select among those that are preempted in kernel mode.
>>
>> Since per-task throttling needs to tag, dequeue, and re-enqueue each
>> task, I'm putting this out as an alternate approach that does not
>> increase the complexities of tg_tree walks which Ben had noted on
>> Valentin's series [1]. Instead we retain the per cfs_rq throttling
>> at the cost of some stats tracking at enqueue and dequeue
>> boundaries.
>>
>> If you have a strong feelings against any specific part, or the entirety
>> of this approach, please do let me know, and I'll do my best to see if
>> a tweaked approach or an alternate implementation can scale well with
>> growing thread counts (or at least try to defend the bits in question if
>> they hold merit still).
>>
>> Any and all feedback is appreciated :)
> 
> Pfff.. I hate it all :-)
> 
> So the dequeue approach puts the pain on the people actually using the
> bandwidth crud,

Back in Josh Don's presentation at the "Humongous Servers vs Kernel
Scalability" BoF [1] at LPC'24, they mentioned one server handles
around "O(250k) threads" (Slide 21)

Assuming 256 logical CPUs from their first first couple of slides, that
is about 1K potential tasks that can be throttled at one go on each
CPU. Doing that within a single rq_lock critical section may take quite
a bit of time.

Is the expectation that these deployments have to be managed more
smartly if we move to a per-task throttling model? Else it is just
hard lockup by a thousand tasks.

If Ben or Josh can comment on any scalability issues they might have
seen on their deployment and any learning they have drawn from them
since LPC'24, it would be great. Any stats on number of tasks that
get throttled at one go would also be helpful.

[1] https://lpc.events/event/18/contributions/1855/attachments/1436/3432/LPC%202024_%20Scalability%20BoF.pdf

> while this 'some extra accounting' crap has *everybody*
> pay for this nonsense, right?

That is correct. Let me go and get some numbers to see if the overhead
is visible but with deeper hierarchies there is a ton that goes on
already which may hide these overheads. I'll try with different levels
on a wakeup heavy task.

> 
> I'm not sure how bad this extra accounting is, but I do fear death by a
> thousand cuts.

We surely don't want that!

-- 
Thanks and Regards,
Prateek

Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Posted by Josh Don 11 months, 3 weeks ago

On Thu, Feb 20, 2025 at 4:04 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Peter,
>
> On 2/20/2025 5:02 PM, Peter Zijlstra wrote:
> > On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
> >> Any and all feedback is appreciated :)
> >
> > Pfff.. I hate it all :-)
> >
> > So the dequeue approach puts the pain on the people actually using the
> > bandwidth crud, while this 'some extra accounting' crap has *everybody*
> > pay for this nonsense, right?

Doing the context tracking could also provide benefit beyond CFS
bandwidth. As an example, we often see a pattern where a thread
acquires one mutex, then sleeps on trying to take a second mutex. When
the thread eventually is woken due to the second mutex now being
available, the thread now needs to wait to get back on cpu, which can
take an arbitrary amount of time depending on where it landed in the
tree, its weight, etc. Other threads trying to acquire that first
mutex now experience priority inversion as they must wait for the
original thread to get back on cpu and release the mutex. Re-using the
same context tracking, we could prioritize execution of threads in
kernel critical sections, even if they aren't the fair next choice.

If that isn't convincing enough, we could certainly throw another
kconfig or boot param for this behavior :)

> Is the expectation that these deployments have to be managed more
> smartly if we move to a per-task throttling model? Else it is just
> hard lockup by a thousand tasks.

+1, I don't see the per-task throttling being able to scale here.

> If Ben or Josh can comment on any scalability issues they might have
> seen on their deployment and any learning they have drawn from them
> since LPC'24, it would be great. Any stats on number of tasks that
> get throttled at one go would also be helpful.

Maybe just to emphasize that we continue to see the same type of
slowness; throttle/unthrottle when traversing a large cgroup
sub-hierarchy is still an issue for us and we're working on sending a
patch to ideally break this up to do the updates more lazily, as
described at LPC.

In particular, throttle/unthrottle (whether it be on a group basis or
a per-task basis) is a loop that is subject to a lot of cache misses.

Best,
Josh

Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Posted by K Prateek Nayak 11 months, 3 weeks ago

Hello Josh,

Thank you for sharing the background!

On 2/21/2025 7:34 AM, Josh Don wrote:
> On Thu, Feb 20, 2025 at 4:04 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Hello Peter,
>>
>> On 2/20/2025 5:02 PM, Peter Zijlstra wrote:
>>> On Thu, Feb 20, 2025 at 04:48:58PM +0530, K Prateek Nayak wrote:
>>>> Any and all feedback is appreciated :)
>>>
>>> Pfff.. I hate it all :-)
>>>
>>> So the dequeue approach puts the pain on the people actually using the
>>> bandwidth crud, while this 'some extra accounting' crap has *everybody*
>>> pay for this nonsense, right?
> 
> Doing the context tracking could also provide benefit beyond CFS
> bandwidth. As an example, we often see a pattern where a thread
> acquires one mutex, then sleeps on trying to take a second mutex. When
> the thread eventually is woken due to the second mutex now being
> available, the thread now needs to wait to get back on cpu, which can
> take an arbitrary amount of time depending on where it landed in the
> tree, its weight, etc. Other threads trying to acquire that first
> mutex now experience priority inversion as they must wait for the
> original thread to get back on cpu and release the mutex. Re-using the
> same context tracking, we could prioritize execution of threads in
> kernel critical sections, even if they aren't the fair next choice.

Just out of curiosity, have you tried running with proxy-execution [1][2]
on your deployments to mitigate priority inversion in mutexes? I've
tested it with smaller scale benchmarks and I haven't seem much overhead
except for in case of a few microbenchmarks but I'm not sure if you've
run into any issues at your scale.

[1] https://lore.kernel.org/lkml/20241125195204.2374458-1-jstultz@google.com/
[2] https://github.com/johnstultz-work/linux-dev/commits/proxy-exec-v14-6.13-rc1/

> 
> If that isn't convincing enough, we could certainly throw another
> kconfig or boot param for this behavior :)
> 
>> Is the expectation that these deployments have to be managed more
>> smartly if we move to a per-task throttling model? Else it is just
>> hard lockup by a thousand tasks.
> 
> +1, I don't see the per-task throttling being able to scale here.
> 
>> If Ben or Josh can comment on any scalability issues they might have
>> seen on their deployment and any learning they have drawn from them
>> since LPC'24, it would be great. Any stats on number of tasks that
>> get throttled at one go would also be helpful.
> 
> Maybe just to emphasize that we continue to see the same type of
> slowness; throttle/unthrottle when traversing a large cgroup
> sub-hierarchy is still an issue for us and we're working on sending a
> patch to ideally break this up to do the updates more lazily, as
> described at LPC.

Is it possible to share an example hierarchy from one of your
deployments? Your presentation for LPC'24 [1] says "O(1000) cgroups" but
is it possible to reveal the kind of nesting you deal with and at which
levels are bandwidth controls set. Even something like "O(10) cgroups on
root with BW throttling set, and each of them contain O(100) cgroups
below" could also help match a test setup.

[2] https://lpc.events/event/18/contributions/1855/attachments/1436/3432/LPC%202024_%20Scalability%20BoF.pdf

> 
> In particular, throttle/unthrottle (whether it be on a group basis or
> a per-task basis) is a loop that is subject to a lot of cache misses.
> 
> Best,
> Josh

-- 
Thanks and Regards,
Prateek

Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

Posted by Josh Don 11 months, 3 weeks ago

On Thu, Feb 20, 2025 at 7:38 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
...
> Just out of curiosity, have you tried running with proxy-execution [1][2]
> on your deployments to mitigate priority inversion in mutexes? I've
> tested it with smaller scale benchmarks and I haven't seem much overhead
> except for in case of a few microbenchmarks but I'm not sure if you've
> run into any issues at your scale.

The confounding issue is that we see tail issues with other types of
primitives, such as semaphores. That led us to trying an approach
similar to yours with treating kernel-mode as a critical section from
the perspective of e.g. CFSB.

> Is it possible to share an example hierarchy from one of your
> deployments? Your presentation for LPC'24 [1] says "O(1000) cgroups" but
> is it possible to reveal the kind of nesting you deal with and at which
> levels are bandwidth controls set. Even something like "O(10) cgroups on
> root with BW throttling set, and each of them contain O(100) cgroups
> below" could also help match a test setup.

Sure, I can help shed some additional light. In terms of cgroup depth,
we try to keep that fairly limited, given the cgroup depth scaling
issues with task enqueue/dequeue. Max depth is maybe around ~5
depending on the exact job configuration, with an average closer to
2-3. However, width is quite large as we have many large dual socket
machines that can handle hundreds of individual jobs (as I called out
in the presentation, larger cpu count leads to more cgroups on the
machine in order to fully utilize resources). The example I referred
to in the presentation looks something like:

root -> subtree_parent (this cgroup has CFSB enabled, period = 100ms)
-> (~300-400 direct children, with some fraction having additional
child cgroups, bringing total to O(1000))

Best,
Josh