arch/powerpc/kernel/smp.c | 45 +++++++++++++++++++++++++++++++++++++++ include/linux/cpumask.h | 14 ++++++++++++ kernel/cpu.c | 3 +++ kernel/sched/core.c | 43 +++++++++++++++++++++++++++++++++++-- kernel/sched/fair.c | 1 + kernel/sched/sched.h | 1 + 6 files changed, 105 insertions(+), 2 deletions(-)
In a para-virtualised environment, there could be multiple overcommitted VMs. i.e sum of virtual CPUs(vCPU) > physical CPU(pCPU). When all such VMs request for cpu cycles at the same, it is not possible to serve all of them. This leads to VM level preemptions and hence the steal time. Bring the notion of CPU parked state which implies underlying pCPU may not be available for use at this time. This means it is better to avoid this vCPU. So when a CPU is marked as parked, one should vacate it as soon as it can. So it is going to dynamic at runtime and can change often. In general, task level preemption(driven by VM) is less expensive than VM level preemption(driven by hypervisor). So pack to less CPUs helps to improve the overall workload throughput/latency. Architecture needs to decide which CPUs are parked. Currently we are exploring getting the hint from the stolen time and hypervisor provided statistics. There is simple powerpc debug patch which shows how one can make use of it cpu parked feature. cpu parking and need for cpu parking has been explained here as well [1]. Much of the context explained in the cover letter there applies to this problem context as well. [1]: https://lore.kernel.org/all/20250512115325.30022-1-huschle@linux.ibm.com/ While trying the above method, on large system (480 vCPUS) it was taking around 8-10 seconds for workload to move. Which is a longer time, so this approach, where workload moves within 1-2 seconds Pros: - Once tasks move, no load balancer overheads - Less need for stats. minimal load balancer changes. - Faster. Since it is based on sched_tick - system maintains a state of parked cpus. Other subsystems may find it useful. Cons: - stop machine based to move the current task. So couldn't move it before it gets scheduled. - Depends on CONFIG_HOTPLUG_CPU since it is relying on __balance_push_cpu_stop (might not be a big concern) Sending this out to get feedback on the idea. This mechanism seems lightweight and fast. There are other push task related patches sent for EAS[2], and newidle balance[3]. Maybe it is time to come up push task framework that each one can make use of. Need to dig more into it[4]. Need to address RT, DL, IRQ, taskset concerns still. There maybe subtle races too(no warn/bugs on console while testing cfs tasks) [2]: https://lore.kernel.org/all/20250302210539.1563190-1-vincent.guittot@linaro.org/ [3]: https://lore.kernel.org/lkml/20250409111539.23791-1-kprateek.nayak@amd.com/ [4]: https://lore.kernel.org/all/xhsmh1putoxbz.mognet@vschneid-thinkpadt14sgen2i.remote.csb/ Based on tip/master at fa95dea97bd1 (Merge branch into tip/master: 'perf/core') Shrikanth Hegde (5): cpumask: Introduce cpu parked mask sched/core: Don't use parked cpu for selection sched/fair: Don't use parked cpu for load balancing sched/core: Push current task when cpu is parked powerpc: Use manual hint for cpu parking arch/powerpc/kernel/smp.c | 45 +++++++++++++++++++++++++++++++++++++++ include/linux/cpumask.h | 14 ++++++++++++ kernel/cpu.c | 3 +++ kernel/sched/core.c | 43 +++++++++++++++++++++++++++++++++++-- kernel/sched/fair.c | 1 + kernel/sched/sched.h | 1 + 6 files changed, 105 insertions(+), 2 deletions(-) -- 2.39.3
On Fri, May 23, 2025 at 11:44:43PM +0530, Shrikanth Hegde wrote: > In a para-virtualised environment, there could be multiple > overcommitted VMs. i.e sum of virtual CPUs(vCPU) > physical CPU(pCPU). > When all such VMs request for cpu cycles at the same, it is not possible > to serve all of them. This leads to VM level preemptions and hence the > steal time. > > Bring the notion of CPU parked state which implies underlying pCPU may > not be available for use at this time. This means it is better to avoid > this vCPU. So when a CPU is marked as parked, one should vacate it as > soon as it can. So it is going to dynamic at runtime and can change > often. You've lost me here already. Why would pCPU not be available? Simply because it is running another vCPU? I would say this means the pCPU is available, its just doing something else. Not available to me means it is going offline or something like that. > In general, task level preemption(driven by VM) is less expensive than VM > level preemption(driven by hypervisor). So pack to less CPUs helps to > improve the overall workload throughput/latency. This seems to suggest you're 'parking' vCPUs, while above you seemed to suggest pCPU. More confusion. > cpu parking and need for cpu parking has been explained here as well [1]. Much > of the context explained in the cover letter there applies to this > problem context as well. > [1]: https://lore.kernel.org/all/20250512115325.30022-1-huschle@linux.ibm.com/ Yeah, totally not following any of that either :/ Mostly I have only confusion and no idea what you're actually wanting to do.
On Tue, May 27, 2025 at 05:10:20PM +0200, Peter Zijlstra wrote: > On Fri, May 23, 2025 at 11:44:43PM +0530, Shrikanth Hegde wrote: > > In a para-virtualised environment, there could be multiple > > overcommitted VMs. i.e sum of virtual CPUs(vCPU) > physical CPU(pCPU). > > When all such VMs request for cpu cycles at the same, it is not possible > > to serve all of them. This leads to VM level preemptions and hence the > > steal time. > > > > Bring the notion of CPU parked state which implies underlying pCPU may > > not be available for use at this time. This means it is better to avoid > > this vCPU. So when a CPU is marked as parked, one should vacate it as > > soon as it can. So it is going to dynamic at runtime and can change > > often. > > You've lost me here already. Why would pCPU not be available? Simply > because it is running another vCPU? I would say this means the pCPU is > available, its just doing something else. > > Not available to me means it is going offline or something like that. > > > In general, task level preemption(driven by VM) is less expensive than VM > > level preemption(driven by hypervisor). So pack to less CPUs helps to > > improve the overall workload throughput/latency. > > This seems to suggest you're 'parking' vCPUs, while above you seemed to > suggest pCPU. More confusion. > > > cpu parking and need for cpu parking has been explained here as well [1]. Much > > of the context explained in the cover letter there applies to this > > problem context as well. > > [1]: https://lore.kernel.org/all/20250512115325.30022-1-huschle@linux.ibm.com/ > > Yeah, totally not following any of that either :/ > > > Mostly I have only confusion and no idea what you're actually wanting to > do. My wild guess is that the idea is to not preempt the pCPU while running a particular vCPU workload. But I agree, this should all be reworded and explained better. I didn't understand this, either. Thanks, YUry
Hi Peter, Yury. Thanks for taking a look at this series. On 5/27/25 21:17, Yury Norov wrote: > On Tue, May 27, 2025 at 05:10:20PM +0200, Peter Zijlstra wrote: >> On Fri, May 23, 2025 at 11:44:43PM +0530, Shrikanth Hegde wrote: >>> In a para-virtualised environment, there could be multiple >>> overcommitted VMs. i.e sum of virtual CPUs(vCPU) > physical CPU(pCPU). >>> When all such VMs request for cpu cycles at the same, it is not possible >>> to serve all of them. This leads to VM level preemptions and hence the >>> steal time. >>> >>> Bring the notion of CPU parked state which implies underlying pCPU may >>> not be available for use at this time. This means it is better to avoid >>> this vCPU. So when a CPU is marked as parked, one should vacate it as >>> soon as it can. So it is going to dynamic at runtime and can change >>> often. >> >> You've lost me here already. Why would pCPU not be available? Simply >> because it is running another vCPU? I would say this means the pCPU is >> available, its just doing something else. >> >> Not available to me means it is going offline or something like that. >> >>> In general, task level preemption(driven by VM) is less expensive than VM >>> level preemption(driven by hypervisor). So pack to less CPUs helps to >>> improve the overall workload throughput/latency. >> >> This seems to suggest you're 'parking' vCPUs, while above you seemed to >> suggest pCPU. More confusion. Yes. I meant parking of vCPUs only. pCPU is running one of those vCPU at any point in time. >> >>> cpu parking and need for cpu parking has been explained here as well [1]. Much >>> of the context explained in the cover letter there applies to this >>> problem context as well. >>> [1]: https://lore.kernel.org/all/20250512115325.30022-1-huschle@linux.ibm.com/ >> >> Yeah, totally not following any of that either :/ >> >> >> Mostly I have only confusion and no idea what you're actually wanting to >> do. > > My wild guess is that the idea is to not preempt the pCPU while running > a particular vCPU workload. But I agree, this should all be reworded and > explained better. I didn't understand this, either. > > Thanks, > YUry Sorry, Apologies for not explaining it clearly. My bad. Let me take a shot at it again: ---------------------------- vCPU - Virtual CPUs - CPU in VM world. pCPU - Physical CPUs - CPU in baremetal world. A hypervisor is managing these vCPUs from different VMs. When a vCPU requests for CPU, hypervisor does the job of scheduling them on a pCPU. So this issue occurs when there are more vCPUs(combined across all VMs) than the pCPU. So when *all* vCPUs are requesting for CPUs, hypervisor can only run a few of them and remaining will be preempted(waiting for pCPU). If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among each other and request for *limited* vCPUs, it avoids the above overhead and there is context switching within vCPU(less expensive). Even if hypervisor is preempting one vCPU to run another withing the same VM, it is still more expensive than the task preemption within the vCPU. So *basic* aim to avoid vCPU preemption. So to achieve this, use this parking(we need better name for sure) concept, where it is better if workloads avoid some vCPUs at this moment. (vCPUs stays online, we don't want the overhead of sched domain rebuild). contention is dynamic in nature. When there is contention for pCPU is to be detected and determined by architecture. Archs needs to update the mask regularly. When there is contention, use limited vCPUs as indicated by arch. When there is no contention, use all vCPUs.
On 27/05/2025 19:30, Shrikanth Hegde wrote: > > Hi Peter, Yury. > > Thanks for taking a look at this series. > > > On 5/27/25 21:17, Yury Norov wrote: >> On Tue, May 27, 2025 at 05:10:20PM +0200, Peter Zijlstra wrote: >>> On Fri, May 23, 2025 at 11:44:43PM +0530, Shrikanth Hegde wrote: >>>> In a para-virtualised environment, there could be multiple >>>> overcommitted VMs. i.e sum of virtual CPUs(vCPU) > physical CPU(pCPU). >>>> When all such VMs request for cpu cycles at the same, it is not >>>> possible >>>> to serve all of them. This leads to VM level preemptions and hence the >>>> steal time. >>>> >>>> Bring the notion of CPU parked state which implies underlying pCPU may >>>> not be available for use at this time. This means it is better to avoid >>>> this vCPU. So when a CPU is marked as parked, one should vacate it as >>>> soon as it can. So it is going to dynamic at runtime and can change >>>> often. >>> >>> You've lost me here already. Why would pCPU not be available? Simply >>> because it is running another vCPU? I would say this means the pCPU is >>> available, its just doing something else. >>> >>> Not available to me means it is going offline or something like that. >>> >>>> In general, task level preemption(driven by VM) is less expensive >>>> than VM >>>> level preemption(driven by hypervisor). So pack to less CPUs helps to >>>> improve the overall workload throughput/latency. >>> >>> This seems to suggest you're 'parking' vCPUs, while above you seemed to >>> suggest pCPU. More confusion. > > Yes. I meant parking of vCPUs only. pCPU is running one of those vCPU at > any point in time. > >>> >>>> cpu parking and need for cpu parking has been explained here as well >>>> [1]. Much >>>> of the context explained in the cover letter there applies to this >>>> problem context as well. >>>> [1]: https://lore.kernel.org/all/20250512115325.30022-1- >>>> huschle@linux.ibm.com/ >>> >>> Yeah, totally not following any of that either :/ >>> >>> >>> Mostly I have only confusion and no idea what you're actually wanting to >>> do. >> >> My wild guess is that the idea is to not preempt the pCPU while running >> a particular vCPU workload. But I agree, this should all be reworded and >> explained better. I didn't understand this, either. >> >> Thanks, >> YUry > > Sorry, Apologies for not explaining it clearly. My bad. > Let me take a shot at it again: > > ---------------------------- > > vCPU - Virtual CPUs - CPU in VM world. > pCPU - Physical CPUs - CPU in baremetal world. > > A hypervisor is managing these vCPUs from different VMs. When a vCPU > requests for CPU, hypervisor does the job > of scheduling them on a pCPU. > > So this issue occurs when there are more vCPUs(combined across all VMs) > than the pCPU. So when *all* vCPUs are > requesting for CPUs, hypervisor can only run a few of them and remaining > will be preempted(waiting for pCPU). > > > If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU > from VM2, it has to do > save/restore VM context.Instead if VM's can co-ordinate among each other > and request for *limited* vCPUs, > it avoids the above overhead and there is context switching within > vCPU(less expensive). Even if hypervisor > is preempting one vCPU to run another withing the same VM, it is still > more expensive than the task preemption within > the vCPU. So *basic* aim to avoid vCPU preemption. > There is a dilemma for the hypervisor scheduler, as it has not too many good indicators on when to preempt a vCPU in favor of another one. Assume we have a hypervisor facing high load. Among others, 1 VM with 2 vCPUs running 2 tasks. Naturally, the scheduler in the VM would place each task on one of the vCPUs. Assume further that, due to the high load, the hypervisor scheduler cannot schedule both vCPUs at the same time consistently. This means, that the hypervisor scheduler now decides which of the 2 tasks gets to run. The scheduler in the VM on the other hand has better insights into which of the two tasks should execute. If the hypervisor can guarantee the guest, that certain vCPUs will be granted runtime on pCPUs consistently, the VM scheduler has a clear expectancy on the availability of its vCPUs and can make use of that information. Essentially, we avoid forcing the hypervisor scheduler to take decisions which it does not have good information on. We'd rather let the VM scheduler take those decisions. This requires of course, that the hypervisor can give a somewhat accurate estimate to the VM on how many CPUs it can safely use, which it should be able to do as it has the information on how much overall load is on the system, which the VM does not have necessarily. A naive approach would be to just divide the available pCPUs through the number of VMs. The more interesting part will be to derive how many vCPUs can be overconsumed if other VMs are underconsuming. In the end, both layers, hypervisor and VM, would take decisions which they have accurate information on. The VM scheduler knows about its tasks, the hypervisor knows about the overall system load. I played around with the concept quite a bit. Especially if there is a lot of load on the VM itself and it tries to squeeze in short-running networking operations. In this case the VM scheduler can take better decisions if it runs less vCPUs, but stays in full control, instead of relying on the hypervisor scheduler to schedule all its vCPUs. > > So to achieve this, use this parking(we need better name for sure) > concept, where it is better > if workloads avoid some vCPUs at this moment. (vCPUs stays online, we > don't want the overhead of sched domain rebuild). > > > contention is dynamic in nature. When there is contention for pCPU is to > be detected and determined > by architecture. Archs needs to update the mask regularly. > > When there is contention, use limited vCPUs as indicated by arch. > When there is no contention, use all vCPUs. > The patches work as expected on s390.
Hi. > > ---------------------------- > > vCPU - Virtual CPUs - CPU in VM world. > pCPU - Physical CPUs - CPU in baremetal world. > > A hypervisor is managing these vCPUs from different VMs. When a vCPU > requests for CPU, hypervisor does the job > of scheduling them on a pCPU. > > So this issue occurs when there are more vCPUs(combined across all VMs) > than the pCPU. So when *all* vCPUs are > requesting for CPUs, hypervisor can only run a few of them and remaining > will be preempted(waiting for pCPU). > > > If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU > from VM2, it has to do > save/restore VM context.Instead if VM's can co-ordinate among each other > and request for *limited* vCPUs, > it avoids the above overhead and there is context switching within > vCPU(less expensive). Even if hypervisor > is preempting one vCPU to run another withing the same VM, it is still > more expensive than the task preemption within > the vCPU. So *basic* aim to avoid vCPU preemption. > > > So to achieve this, use this parking(we need better name for sure) > concept, where it is better > if workloads avoid some vCPUs at this moment. (vCPUs stays online, we > don't want the overhead of sched domain rebuild). > > > contention is dynamic in nature. When there is contention for pCPU is to > be detected and determined > by architecture. Archs needs to update the mask regularly. > > When there is contention, use limited vCPUs as indicated by arch. > When there is no contention, use all vCPUs. > I hope this helped to set the problem context. I am trying to get feedback if the approach makes sense. I will go through other push mechanism we have (example in rt/dl).
© 2016 - 2025 Red Hat, Inc.