sched: cpu parked and push current task mechanism

[RFC PATCH 0/5] sched: cpu parked and push current task mechanism

Posted by Shrikanth Hegde 8 months, 2 weeks ago

In a para-virtualised environment, there could be multiple
overcommitted VMs. i.e sum of virtual CPUs(vCPU) > physical CPU(pCPU).
When all such VMs request for cpu cycles at the same, it is not possible
to serve all of them. This leads to VM level preemptions and hence the
steal time.

Bring the notion of CPU parked state which implies underlying pCPU may
not be available for use at this time. This means it is better to avoid
this vCPU. So when a CPU is marked as parked, one should vacate it as
soon as it can. So it is going to dynamic at runtime and can change
often.

In general, task level preemption(driven by VM) is less expensive than VM
level preemption(driven by hypervisor). So pack to less CPUs helps to
improve the overall workload throughput/latency.

Architecture needs to decide which CPUs are parked. Currently we are
exploring getting the hint from the stolen time and hypervisor provided
statistics. There is simple powerpc debug patch which shows how one can
make use of it cpu parked feature.

cpu parking and need for cpu parking has been explained here as well [1]. Much
of the context explained in the cover letter there applies to this
problem context as well.
[1]: https://lore.kernel.org/all/20250512115325.30022-1-huschle@linux.ibm.com/

While trying the above method, on large system (480 vCPUS) it was taking
around 8-10 seconds for workload to move. Which is a longer time,
so this approach, where workload moves within 1-2 seconds

Pros:
- Once tasks move, no load balancer overheads
- Less need for stats. minimal load balancer changes.
- Faster. Since it is based on sched_tick
- system maintains a state of parked cpus. Other subsystems may find it
useful.

Cons:
- stop machine based to move the current task. So couldn't move it
before it gets scheduled.
- Depends on CONFIG_HOTPLUG_CPU since it is relying on __balance_push_cpu_stop
(might not be a big concern)

Sending this out to get feedback on the idea. This mechanism
seems lightweight and fast. There are other push task related patches
sent for EAS[2], and newidle balance[3]. Maybe it is time to come up push task
framework that each one can make use of. Need to dig more into it[4].
Need to address RT, DL, IRQ, taskset concerns still. There maybe
subtle races too(no warn/bugs on console while testing cfs tasks)

[2]: https://lore.kernel.org/all/20250302210539.1563190-1-vincent.guittot@linaro.org/
[3]: https://lore.kernel.org/lkml/20250409111539.23791-1-kprateek.nayak@amd.com/
[4]: https://lore.kernel.org/all/xhsmh1putoxbz.mognet@vschneid-thinkpadt14sgen2i.remote.csb/

Based on tip/master at fa95dea97bd1 (Merge branch into tip/master: 'perf/core')

Shrikanth Hegde (5):
cpumask: Introduce cpu parked mask
sched/core: Don't use parked cpu for selection
sched/fair: Don't use parked cpu for load balancing
sched/core: Push current task when cpu is parked
powerpc: Use manual hint for cpu parking

--
2.39.3