tl;dr
This is follow up of [1] with few fixes and addressing review comments.
Upgraded it to RFC PATCH from RFC.
Please review.
[1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
v2 -> v3:
- Renamed to paravirt CPUs
- Folded the changes under CONFIG_PARAVIRT.
- Fixed the crash due work_buf corruption while using
stop_one_cpu_nowait.
- Added sysfs documentation.
- Copy most of __balance_push_cpu_stop to new one, this helps it move
the code out of CONFIG_HOTPLUG_CPU.
- Some of the code movement suggested.
-----------------
::Detailed info::
-----------------
Problem statement
vCPU - Virtual CPUs - CPU in VM world.
pCPU - Physical CPUs - CPU in baremetal world.
A hypervisor does scheduling of vCPUs on a pCPUs. It has to give each
vCPU some cycles and be fair. When there are more vCPU requests than
the pCPUs, hypervsior has to preempt some vCPUs in order to run others.
This is called as vCPU preemption.
If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from
VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
each other and request for limited vCPUs, it avoids the above overhead and
there is context switching within vCPU(less expensive). Even if hypervisor
is preempting one vCPU to run another within the same VM, it is still more
expensive than the task preemption within the vCPU. So basic aim to avoid
vCPU preemption.
So to achieve this, introduce "Paravirt CPU" concept, where it is better if
workload avoids these vCPUs at this moment. (vCPUs stays online, don't want
the overhead of sched domain rebuild and hotplug takes a lot of time too).
When there is contention, don't use paravirt CPUs.
When there is no contention, use all vCPUs.
----------------------------------
Implementation details and choices
- current version copies most of code in __balance_push_cpu_stop. This
was done to avoid the CONFIG_HOTPLUG_CPU dependency and move it
under CONFIG_PARAVIRT. This also allows fixing the race in
stop_one_cpu_nowait. Hacks are needed in __balance_push_cpu_stop
otherwise.
- Did explore using task_work_add instead of stop_one_cpu_nowait,
something similar to what mm_cid does. It ended up in locking up the
system sometimes. Takes slightly longer to move tasks compared to
stop_one_cpu_nowait
- Tried using push_cpu_stop instead of adding more code. Made it work
for CFS by adding find_lock_rq. But rt tasks fail to move out of paravirt
CPUs completely. There is like 5-10% utilization left. Maybe it races
with pull/push rt tasks since they all use push_busy for gating.
- Kept the helper patch where one could specify the cpulist to set the
paravirt CPUs. It helped to uncover some of the corner cases. Such as
if say CPUs 0-100 are marked as paravirt. Number based debug file didn't do
that. Nature of hint could change, so kept both the flavours as of now.
Depending on how hint design goes will change it accordingly.
---------------------
bloat-o-meter reports
- CONFIG_PARAVIRT=y
add/remove: 12/0 grow/shrink: 14/0 up/down: 1767/0 (1767)
Function old new delta
paravirt_push_cpu_stop - 479 +479
push_current_from_paravirt_cpu - 410 +410
store_paravirt_cpus - 174 +174
...
Total: Before=25132435, After=25134202, chg +0.01%
Values depend on NR_CPUS. Above data is for NR_CPUS=64 on x86.
add/remove: 18/3 grow/shrink: 26/12 up/down: 5320/-484 (4836)
Function old new delta
__cpu_paravirt_mask - 1024 +1024
paravirt_push_cpu_stop - 864 +864
push_current_from_paravirt_cpu - 648 +648
...
Total: Before=30273517, After=30278353, chg +0.02%
on PowerPC with NR_CPUS=8192.
- CONFIG_PARAVIRT=n
add/remove: 0/0 grow/shrink: 2/1 up/down: 35/-32 (3)
Function old new delta
select_task_rq_fair 4376 4395 +19
check_preempt_wakeup_fair 895 911 +16
set_next_entity 659 627 -32
Total: Before=25106525, After=25106528, chg +0.00%
------------------------------
Functional and Performance data
- tasks move out of paravirt CPUs quite fast. Even when system is
heavily loaded, max it takes 1-2 seconds for tasks to move out of all
paravirt CPUs.
- schbench results. Experiments were done on a system with physical 94 cores.
Two Shared Processor LPARs(VMs). LPAR1 has 90 Cores(entitled 60) and
LPAR2 has 64 Cores(entitled 32). Entitled here means it should get those
many cores worth of cycles at least. When both LPAR run at high
utilization at the same time, there will be contention and high steal
time was seen. When there is contention, the non-entitled number of
Cores were made as paravirt CPUs. In another experiment non-entitled
cpus were hotplugged. Both data below shows advantage in using
paravirt CPUs instead.
LPAR1 is running schbench and LPAR2 is running stress-ng intermittently
i.e busy/idle (stress-ng is running for 60sec and then idle for 60 sec)
Wakeup Latencies Out of Box cpu_hotplug cpu_paravirt
50.0th: 15 15 14
90.0th: 70 25 19
99.0th: 3084 345 95
99.9th: 6184 3004 523
When the busy/idle duration is reduced close to 10 seconds in LPAR2,
the benefit of cpu_paravirt reduces. cpu_hotplug wont work in those
cases at all since hotplug operation itself takes close to 20+
seconds. Benefit of cpu_paravirt shows up compared to out of box when
the busy/idle duration is greater than 10 seconds. When the concurrency
of the system is lowered, benefit is seen even with 10 seconds. So
using paravirt CPUs will likely help workloads which are sensitive to
latency.
------------
Open issues:
- Derivation of hint from steal time is still a challenge. Some work is
underway to address it.
- Consider kvm and other hypervsiors and how they could derive the hint.
Need inputs from community.
- make irqbalance understand cpu_paravirt_mask.
- works on nohz_full cpus somewhat, but doesn't completely move out of few CPUs.
Was wondering if it would work at all since tick is usually disabled there.
Need to understand/investigate further.
Shrikanth Hegde (10):
sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
cpumask: Introduce cpu_paravirt_mask
sched: Static key to check paravirt cpu push
sched/core: Dont allow to use CPU marked as paravirt
sched/fair: Don't consider paravirt CPUs for wakeup and load balance
sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task
sched/core: Push current task from paravirt CPU
sysfs: Add cpu paravirt file
powerpc: Add debug file for set/unset paravirt CPUs
sysfs: Provide write method for paravirt
.../ABI/testing/sysfs-devices-system-cpu | 9 ++
Documentation/scheduler/sched-arch.rst | 37 +++++++
arch/powerpc/include/asm/paravirt.h | 1 +
arch/powerpc/kernel/smp.c | 58 ++++++++++
drivers/base/base.h | 4 +
drivers/base/cpu.c | 53 +++++++++
include/linux/cpumask.h | 15 +++
kernel/sched/core.c | 103 +++++++++++++++++-
kernel/sched/fair.c | 15 ++-
kernel/sched/rt.c | 11 +-
kernel/sched/sched.h | 26 ++++-
11 files changed, 325 insertions(+), 7 deletions(-)
--
2.47.3