[RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption

Shrikanth Hegde posted 10 patches 4 months, 4 weeks ago
.../ABI/testing/sysfs-devices-system-cpu      |   9 ++
Documentation/scheduler/sched-arch.rst        |  37 +++++++
arch/powerpc/include/asm/paravirt.h           |   1 +
arch/powerpc/kernel/smp.c                     |  58 ++++++++++
drivers/base/base.h                           |   4 +
drivers/base/cpu.c                            |  53 +++++++++
include/linux/cpumask.h                       |  15 +++
kernel/sched/core.c                           | 103 +++++++++++++++++-
kernel/sched/fair.c                           |  15 ++-
kernel/sched/rt.c                             |  11 +-
kernel/sched/sched.h                          |  26 ++++-
11 files changed, 325 insertions(+), 7 deletions(-)
[RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
Posted by Shrikanth Hegde 4 months, 4 weeks ago
tl;dr

This is follow up of [1] with few fixes and addressing review comments.
Upgraded it to RFC PATCH from RFC. 
Please review. 

[1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/

v2 -> v3:
- Renamed to paravirt CPUs
- Folded the changes under CONFIG_PARAVIRT.
- Fixed the crash due work_buf corruption while using
  stop_one_cpu_nowait. 
- Added sysfs documentation.
- Copy most of __balance_push_cpu_stop to new one, this helps it move 
  the code out of CONFIG_HOTPLUG_CPU. 
- Some of the code movement suggested. 

-----------------
::Detailed info:: 
-----------------
Problem statement 

vCPU - Virtual CPUs - CPU in VM world.
pCPU - Physical CPUs - CPU in baremetal world.

A hypervisor does scheduling of vCPUs on a pCPUs. It has to give each
vCPU some cycles and be fair. When there are more vCPU requests than
the pCPUs, hypervsior has to preempt some vCPUs in order to run others.
This is called as vCPU preemption.

If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from 
VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
each other and request for limited  vCPUs, it avoids the above overhead and 
there is context switching within vCPU(less expensive). Even if hypervisor
is preempting one vCPU to run another within the same VM, it is still more 
expensive than the task preemption within the vCPU. So basic aim to avoid 
vCPU preemption.

So to achieve this, introduce "Paravirt CPU" concept, where it is better if
workload avoids these vCPUs at this moment. (vCPUs stays online, don't want
the overhead of sched domain rebuild and hotplug takes a lot of time too).

When there is contention, don't use paravirt CPUs.
When there is no contention, use all vCPUs. 

----------------------------------
Implementation details and choices

- current version copies most of code in __balance_push_cpu_stop. This
  was done to avoid the CONFIG_HOTPLUG_CPU dependency and move it
  under CONFIG_PARAVIRT. This also allows fixing the race in
  stop_one_cpu_nowait. Hacks are needed in __balance_push_cpu_stop
  otherwise. 

- Did explore using task_work_add instead of stop_one_cpu_nowait,
  something similar to what mm_cid does. It ended up in locking up the
  system sometimes. Takes slightly longer to move tasks compared to
  stop_one_cpu_nowait

- Tried using push_cpu_stop instead of adding more code. Made it work 
  for CFS by adding find_lock_rq. But rt tasks fail to move out of paravirt
  CPUs completely. There is like 5-10% utilization left. Maybe it races
  with pull/push rt tasks since they all use push_busy for gating.

- Kept the helper patch where one could specify the cpulist to set the
  paravirt CPUs. It helped to uncover some of the corner cases. Such as
  if say CPUs 0-100 are marked as paravirt. Number based debug file didn't do
  that. Nature of hint could change, so kept both the flavours as of now.
  Depending on how hint design goes will change it accordingly.

---------------------
bloat-o-meter reports

- CONFIG_PARAVIRT=y
add/remove: 12/0 grow/shrink: 14/0 up/down: 1767/0 (1767)
Function                                     old     new   delta
paravirt_push_cpu_stop                         -     479    +479
push_current_from_paravirt_cpu                 -     410    +410
store_paravirt_cpus                            -     174    +174
...
Total: Before=25132435, After=25134202, chg +0.01%
Values depend on NR_CPUS. Above data is for NR_CPUS=64 on x86.

add/remove: 18/3 grow/shrink: 26/12 up/down: 5320/-484 (4836)
Function                                     old     new   delta
__cpu_paravirt_mask                            -    1024   +1024
paravirt_push_cpu_stop                         -     864    +864
push_current_from_paravirt_cpu                 -     648    +648
...
Total: Before=30273517, After=30278353, chg +0.02%
on PowerPC with NR_CPUS=8192.


- CONFIG_PARAVIRT=n
add/remove: 0/0 grow/shrink: 2/1 up/down: 35/-32 (3)
Function                                     old     new   delta
select_task_rq_fair                         4376    4395     +19
check_preempt_wakeup_fair                    895     911     +16
set_next_entity                              659     627     -32
Total: Before=25106525, After=25106528, chg +0.00%

------------------------------
Functional and Performance data

- tasks move out of paravirt CPUs quite fast. Even when system is
  heavily loaded, max it takes 1-2 seconds for tasks to move out of all
  paravirt CPUs.

- schbench results. Experiments were done on a system with physical 94 cores.
  Two Shared Processor LPARs(VMs). LPAR1 has 90 Cores(entitled 60) and
  LPAR2 has 64 Cores(entitled 32). Entitled here means it should get those
  many cores worth of cycles at least. When both LPAR run at high
  utilization at the same time, there will be contention and high steal
  time was seen. When there is contention, the non-entitled number of
  Cores were made as paravirt CPUs. In another experiment non-entitled
  cpus were hotplugged. Both data below shows advantage in using
  paravirt CPUs instead.
  LPAR1 is running schbench and LPAR2 is running stress-ng intermittently
  i.e busy/idle (stress-ng is running for 60sec and then idle for 60 sec)

Wakeup Latencies      Out of Box        cpu_hotplug       cpu_paravirt
50.0th: 		15		   15			14
90.0th: 		70         	   25                   19
99.0th: 	      3084		  345                   95 
99.9th: 	      6184               3004                  523

  When the busy/idle duration is reduced close to 10 seconds in LPAR2,
  the benefit of cpu_paravirt reduces. cpu_hotplug wont work in those
  cases at all since hotplug operation itself takes close to 20+
  seconds. Benefit of cpu_paravirt shows up compared to out of box when
  the busy/idle duration is greater than 10 seconds. When the concurrency
  of the system is lowered, benefit is seen even with 10 seconds. So
  using paravirt CPUs will likely help workloads which are sensitive to
  latency.
------------
Open issues: 

- Derivation of hint from steal time is still a challenge. Some work is
  underway to address it. 

- Consider kvm and other hypervsiors and how they could derive the hint.
  Need inputs from community. 

- make irqbalance understand cpu_paravirt_mask. 

- works on nohz_full cpus somewhat, but doesn't completely move out of few CPUs.
  Was wondering if it would work at all since tick is usually disabled there. 
  Need to understand/investigate further. 

Shrikanth Hegde (10):
  sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept
  cpumask: Introduce cpu_paravirt_mask
  sched: Static key to check paravirt cpu push
  sched/core: Dont allow to use CPU marked as paravirt
  sched/fair: Don't consider paravirt CPUs for wakeup and load balance
  sched/rt: Don't select paravirt CPU for wakeup and push/pull rt task
  sched/core: Push current task from paravirt CPU
  sysfs: Add cpu paravirt file
  powerpc: Add debug file for set/unset paravirt CPUs
  sysfs: Provide write method for paravirt

 .../ABI/testing/sysfs-devices-system-cpu      |   9 ++
 Documentation/scheduler/sched-arch.rst        |  37 +++++++
 arch/powerpc/include/asm/paravirt.h           |   1 +
 arch/powerpc/kernel/smp.c                     |  58 ++++++++++
 drivers/base/base.h                           |   4 +
 drivers/base/cpu.c                            |  53 +++++++++
 include/linux/cpumask.h                       |  15 +++
 kernel/sched/core.c                           | 103 +++++++++++++++++-
 kernel/sched/fair.c                           |  15 ++-
 kernel/sched/rt.c                             |  11 +-
 kernel/sched/sched.h                          |  26 ++++-
 11 files changed, 325 insertions(+), 7 deletions(-)

-- 
2.47.3
Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
Posted by Sean Christopherson 3 months, 2 weeks ago
On Wed, Sep 10, 2025, Shrikanth Hegde wrote:
> tl;dr
> 
> This is follow up of [1] with few fixes and addressing review comments.
> Upgraded it to RFC PATCH from RFC. 
> Please review. 
> 
> [1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
> 
> v2 -> v3:
> - Renamed to paravirt CPUs

There are myriad uses of "paravirt" throughout Linux and related environments,
and none of them mean "oversubscribed" or "contended".  I assume Hillf's comments
triggered the rename from "avoid CPUs", but IMO "avoid" is at least somewhat
accurate; "paravirt" is wildly misleading.

> - Folded the changes under CONFIG_PARAVIRT.
> - Fixed the crash due work_buf corruption while using
>   stop_one_cpu_nowait. 
> - Added sysfs documentation.
> - Copy most of __balance_push_cpu_stop to new one, this helps it move 
>   the code out of CONFIG_HOTPLUG_CPU. 
> - Some of the code movement suggested. 
> 
> -----------------
> ::Detailed info:: 
> -----------------
> Problem statement 
> 
> vCPU - Virtual CPUs - CPU in VM world.
> pCPU - Physical CPUs - CPU in baremetal world.
> 
> A hypervisor does scheduling of vCPUs on a pCPUs. It has to give each
> vCPU some cycles and be fair. When there are more vCPU requests than
> the pCPUs, hypervsior has to preempt some vCPUs in order to run others.
> This is called as vCPU preemption.
> 
> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from 
> VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
> each other and request for limited  vCPUs, it avoids the above overhead and 
> there is context switching within vCPU(less expensive). Even if hypervisor
> is preempting one vCPU to run another within the same VM, it is still more 
> expensive than the task preemption within the vCPU. So basic aim to avoid 
> vCPU preemption.
> 
> So to achieve this, introduce "Paravirt CPU" concept, where it is better if
> workload avoids these vCPUs at this moment. (vCPUs stays online, don't want
> the overhead of sched domain rebuild and hotplug takes a lot of time too).
> 
> When there is contention, don't use paravirt CPUs.
> When there is no contention, use all vCPUs. 

...

> ------------
> Open issues: 
> 
> - Derivation of hint from steal time is still a challenge. Some work is
>   underway to address it. 
> 
> - Consider kvm and other hypervsiors and how they could derive the hint.
>   Need inputs from community. 

Bluntly, this series is never going to land, at least not in a form that's remotely
close to what is proposed here.  This is an incredibly simplistic way of handling
overcommit, and AFAICT there's no line of sight to supporting more complex scenarios.

I.e. I don't see a path to resolving all these "todos" in the changelog from the
last patch:

 : Ideal would be get the hint from hypervisor. It would be more accurate
 : since it has knowledge of all SPLPARs deployed in the system.
 : 
 : Till the hint from underlying hypervisor arrives, another idea is to
 : approximate the hint from steal time. There are some works ongoing, but
 : not there yet due to challenges revolving around limits and
 : convergence.
 : 
 : Till that happens, there is a need for debugfs file which could be used to
 : set/unset the hint. The interface currently is number starting from which
 : CPUs will marked as paravirt. It could be changed to one the takes a
 : cpumask(list of CPUs) in future.

I see Vineeth and Steven are on the Cc.  Argh, and you even commented on their
first RFC[1], where it was made quite clear that sprinkling one-off "hints"
throughoug the kernel wasn't a viable approach.

I don't know the current status of the ChromeOS work, but there was agreement in
principle that the bulk of paravirt scheduling should not need to touch the kernel
(host or guest)[2].

[1] https://lore.kernel.org/all/20231214024727.3503870-1-vineeth@bitbyteword.org
[2] https://lore.kernel.org/all/ZjJf27yn-vkdB32X@google.com
Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
Posted by Shrikanth Hegde 3 months, 2 weeks ago
Hi Sean.
Thanks for taking time and going through the series.

On 10/20/25 8:02 PM, Sean Christopherson wrote:
> On Wed, Sep 10, 2025, Shrikanth Hegde wrote:
>> tl;dr
>>
>> This is follow up of [1] with few fixes and addressing review comments.
>> Upgraded it to RFC PATCH from RFC.
>> Please review.
>>
>> [1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
>>
>> v2 -> v3:
>> - Renamed to paravirt CPUs
> 
> There are myriad uses of "paravirt" throughout Linux and related environments,
> and none of them mean "oversubscribed" or "contended".  I assume Hillf's comments
> triggered the rename from "avoid CPUs", but IMO "avoid" is at least somewhat
> accurate; "paravirt" is wildly misleading.

Name has been tricky. We want to have a positive sounding name while conveying
that these CPUs are not be used for now due to contention,
they may be used again when the contention has gone.


> 
>> - Folded the changes under CONFIG_PARAVIRT.
>> - Fixed the crash due work_buf corruption while using
>>    stop_one_cpu_nowait.
>> - Added sysfs documentation.
>> - Copy most of __balance_push_cpu_stop to new one, this helps it move
>>    the code out of CONFIG_HOTPLUG_CPU.
>> - Some of the code movement suggested.
>>
>> -----------------
>> ::Detailed info::
>> -----------------
>> Problem statement
>>
>> vCPU - Virtual CPUs - CPU in VM world.
>> pCPU - Physical CPUs - CPU in baremetal world.
>>
>> A hypervisor does scheduling of vCPUs on a pCPUs. It has to give each
>> vCPU some cycles and be fair. When there are more vCPU requests than
>> the pCPUs, hypervsior has to preempt some vCPUs in order to run others.
>> This is called as vCPU preemption.
>>
>> If we take two VM's, When hypervisor preempts vCPU from VM1 to run vCPU from
>> VM2, it has to do save/restore VM context.Instead if VM's can co-ordinate among
>> each other and request for limited  vCPUs, it avoids the above overhead and
>> there is context switching within vCPU(less expensive). Even if hypervisor
>> is preempting one vCPU to run another within the same VM, it is still more
>> expensive than the task preemption within the vCPU. So basic aim to avoid
>> vCPU preemption.
>>
>> So to achieve this, introduce "Paravirt CPU" concept, where it is better if
>> workload avoids these vCPUs at this moment. (vCPUs stays online, don't want
>> the overhead of sched domain rebuild and hotplug takes a lot of time too).
>>
>> When there is contention, don't use paravirt CPUs.
>> When there is no contention, use all vCPUs.
> 
> ...
> 
>> ------------
>> Open issues:
>>
>> - Derivation of hint from steal time is still a challenge. Some work is
>>    underway to address it.
>>
>> - Consider kvm and other hypervsiors and how they could derive the hint.
>>    Need inputs from community.
> 
> Bluntly, this series is never going to land, at least not in a form that's remotely
> close to what is proposed here.  This is an incredibly simplistic way of handling
> overcommit, and AFAICT there's no line of sight to supporting more complex scenarios.
> 

Could you describe these complex scenarios?

Current usecase has been on two archs. powerpc and s390.
IIUC, both have an non-linux hypervisor running on host and linux guests.

Currently the s390 Hypervsior has a way of marking vCPU has Vertical High,
vertical Medium, Vertical Low. So when there is steal time, arch could easily
mark vertical Lows as "paravirt" CPUs.

> I.e. I don't see a path to resolving all these "todos" in the changelog from the
> last patch:
> 
>   : Ideal would be get the hint from hypervisor. It would be more accurate
>   : since it has knowledge of all SPLPARs deployed in the system.
>   :
>   : Till the hint from underlying hypervisor arrives, another idea is to
>   : approximate the hint from steal time. There are some works ongoing, but
>   : not there yet due to challenges revolving around limits and
>   : convergence.
>   :
>   : Till that happens, there is a need for debugfs file which could be used to
>   : set/unset the hint. The interface currently is number starting from which
>   : CPUs will marked as paravirt. It could be changed to one the takes a
>   : cpumask(list of CPUs) in future.
> 
> I see Vineeth and Steven are on the Cc.  Argh, and you even commented on their
> first RFC[1], where it was made quite clear that sprinkling one-off "hints"
> throughoug the kernel wasn't a viable approach.

IIRC, it was in other direction. guest was asking the host to mark some vCPU has
RT task to have it boosted in host.

> 
> I don't know the current status of the ChromeOS work, but there was agreement in
> principle that the bulk of paravirt scheduling should not need to touch the kernel
> (host or guest)[2].
> 

Based on some event if all the tasks on a CPU have to move out, then scheduler needs to
be there no? to move the task out, and not schedule anything new on it.

The current mechanisms such as cpu hotplug, isolated partitions all break the task affinity.
So need a new mechanism.

Note: Host is not running linux kernel. We are requesting host to provide this info through
HCALL or VPA area.

> [1] https://lore.kernel.org/all/20231214024727.3503870-1-vineeth@bitbyteword.org
> [2] https://lore.kernel.org/all/ZjJf27yn-vkdB32X@google.com

Vineeth,
whats the latest on vcpu_boosted framework? AFAIR both guest/host were running linux there.
Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
Posted by Sean Christopherson 3 months, 2 weeks ago
On Tue, Oct 21, 2025, Shrikanth Hegde wrote:
> 
> Hi Sean.
> Thanks for taking time and going through the series.
> 
> On 10/20/25 8:02 PM, Sean Christopherson wrote:
> > On Wed, Sep 10, 2025, Shrikanth Hegde wrote:
> > > tl;dr
> > > 
> > > This is follow up of [1] with few fixes and addressing review comments.
> > > Upgraded it to RFC PATCH from RFC.
> > > Please review.
> > > 
> > > [1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
> > > 
> > > v2 -> v3:
> > > - Renamed to paravirt CPUs
> > 
> > There are myriad uses of "paravirt" throughout Linux and related environments,
> > and none of them mean "oversubscribed" or "contended".  I assume Hillf's comments
> > triggered the rename from "avoid CPUs", but IMO "avoid" is at least somewhat
> > accurate; "paravirt" is wildly misleading.
> 
> Name has been tricky. We want to have a positive sounding name while
> conveying that these CPUs are not be used for now due to contention,
> they may be used again when the contention has gone.

I suspect part of the problem with naming is the all-or-nothing approach itself.
There's a _lot_ of policy baked into that seemingly simple decision, and thus
it's hard to describe with a human-friendly name.

> > > Open issues:
> > > 
> > > - Derivation of hint from steal time is still a challenge. Some work is
> > >    underway to address it.
> > > 
> > > - Consider kvm and other hypervsiors and how they could derive the hint.
> > >    Need inputs from community.
> > 
> > Bluntly, this series is never going to land, at least not in a form that's remotely
> > close to what is proposed here.  This is an incredibly simplistic way of handling
> > overcommit, and AFAICT there's no line of sight to supporting more complex scenarios.
> > 
> 
> Could you describe these complex scenarios?

Any setup where "don't use this CPU" isn't a viable option, e.g. because all cores
could be overcommitted at any given time, or is far, far too coarse-grained.  Very
few use cases can distill vCPU scheduling needs and policies into single flag.

E.g. if all CPUs in a system are being used to vCPU tasks, all vCPUs are actively
running, and the host has a non-vCPU task that _must_ run, then the host will need
to preempt a vCPU task.  Ideally, a paravirtualized scheduling system would allow
the host to make an informed decision when choosing which vCPU to preempt, e.g. to
minimize disruption to the guest(s).
Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
Posted by Shrikanth Hegde 3 months, 1 week ago
Hi Sean.

On 10/23/25 12:16 AM, Sean Christopherson wrote:
> On Tue, Oct 21, 2025, Shrikanth Hegde wrote:
>>
>> Hi Sean.
>> Thanks for taking time and going through the series.
>>
>> On 10/20/25 8:02 PM, Sean Christopherson wrote:
>>> On Wed, Sep 10, 2025, Shrikanth Hegde wrote:
>>>> tl;dr
>>>>
>>>> This is follow up of [1] with few fixes and addressing review comments.
>>>> Upgraded it to RFC PATCH from RFC.
>>>> Please review.
>>>>
>>>> [1]: v2 - https://lore.kernel.org/all/20250625191108.1646208-1-sshegde@linux.ibm.com/
>>>>
>>>> v2 -> v3:
>>>> - Renamed to paravirt CPUs
>>>
>>> There are myriad uses of "paravirt" throughout Linux and related environments,
>>> and none of them mean "oversubscribed" or "contended".  I assume Hillf's comments
>>> triggered the rename from "avoid CPUs", but IMO "avoid" is at least somewhat
>>> accurate; "paravirt" is wildly misleading.
>>
>> Name has been tricky. We want to have a positive sounding name while
>> conveying that these CPUs are not be used for now due to contention,
>> they may be used again when the contention has gone.
> 
> I suspect part of the problem with naming is the all-or-nothing approach itself.
> There's a _lot_ of policy baked into that seemingly simple decision, and thus
> it's hard to describe with a human-friendly name.
> 

open for suggestions :)

>>>> Open issues:
>>>>
>>>> - Derivation of hint from steal time is still a challenge. Some work is
>>>>     underway to address it.
>>>>
>>>> - Consider kvm and other hypervsiors and how they could derive the hint.
>>>>     Need inputs from community.
>>>
>>> Bluntly, this series is never going to land, at least not in a form that's remotely
>>> close to what is proposed here.  This is an incredibly simplistic way of handling
>>> overcommit, and AFAICT there's no line of sight to supporting more complex scenarios.
>>>
>>
>> Could you describe these complex scenarios?
> 
> Any setup where "don't use this CPU" isn't a viable option, e.g. because all cores
> could be overcommitted at any given time, or is far, far too coarse-grained.  Very
> few use cases can distill vCPU scheduling needs and policies into single flag.
> 

Okay. Let me explain whats the current thought process is.
  
On S390 and pseries are the current main use cases.

On S390, Z hypervisor provides distinction among vCPUs. vCPU are marked as Vertical High,
Vertical Medium and Vertical Low. When there is a steal time it is recommended
to use Vertical Highs and avoid using Vertical Lows. In such cases, using this infra, one
can avoid scheduling anything on these vertical low vCPUs. Performance benefit is
observed since there is less contention and CPU cycles are mainly from Vertical Highs.

On PowerVM hypervisor, hypervisor dispatches full core at a time. all SMT=8 siblings are dispatched
to the same core always. That means it beneficial to schedule on vCPU siblings together at core level.
When there is contention for pCPU full core is preempted. i.e all vCPU belonging to that core would be
preempted. In such cases, depending on the configuration of overcommit, and depending on the steal time
one could limit the number of cores usage by using limited vCPUs. When done in that way, we see better
latency numbers and increase in throughput compared to out-box. The cover letter has those numbers.

Now, lets come to KVM with Linux running as Hypervisor. Correct me if i am wrong.
each vCPU in KVM will be a process in the host. when vCPU is running, that process will be in
running state as well. When there is overcommit and all vCPU are running, there will be more
process than number of physical CPUs and host has to context switch and will preempt one vCPU
to run another. It can also preempt vCPU to run some host process.
If we restrict the number of vCPU where workload is currently running, then
number of runnable process in the host also will reduce and less chance of host context switches.
Since this avoid any overhead of kvm context save/restore the workload is likely to benefit.

I guess it is possible to distinguish between host process and vCPU running as process.
If so, host can decide how many threads it can optimally run and give signal to each guest
depending on the configuration.

Currently keeping it arch dependent, since IMHO it is each Hypervisor is in right place to
make decision. Not sure if one fit for all approach works here.


Another tricky point is how this signal is going to be. It could be hcall, or vpa area
or some shared memory region or using bpf method similar to vCPU boosting patch series.
There too, i think it is best to leave to arch to specify how. the reason being bpf method
will not work for powerVM hypervisors.

> E.g. if all CPUs in a system are being used to vCPU tasks, all vCPUs are actively
> running, and the host has a non-vCPU task that _must_ run, then the host will need
> to preempt a vCPU task.  Ideally, a paravirtualized scheduling system would allow

Host/Hypervsior need not make the vCPU as "Not use" every single time it preempts.
It needs to do so, only when there are more vCPU processes than number of physical CPUS and
preemption is happening between vCPU process.

There would be corner cases such as only one physical process is there, and two
KVM each with one vCPU, then nothing much can be done.

> the host to make an informed decision when choosing which vCPU to preempt, e.g. to
> minimize disruption to the guest(s).
Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
Posted by Paolo Bonzini 3 months, 2 weeks ago
On 10/20/25 16:32, Sean Christopherson wrote:
>   : Till the hint from underlying hypervisor arrives, another idea is to
>   : approximate the hint from steal time.

I think this is the first thing to look at.

Perhaps single_task_running() can be exposed in the x86 steal time data 
structure, and in fact even in the rseq data for non-VM usecases?  This 
is not specific to VMs and I'd like the steal time implementation to 
follow the footsteps of rseq rather than the opposite.

Paolo
Re: [RFC PATCH v3 00/10] paravirt CPUs and push task for less vCPU preemption
Posted by Shrikanth Hegde 3 months, 2 weeks ago
Hi Paolo. Thanks for looking into this series.

On 10/20/25 8:35 PM, Paolo Bonzini wrote:
> On 10/20/25 16:32, Sean Christopherson wrote:
>>   : Till the hint from underlying hypervisor arrives, another idea is to
>>   : approximate the hint from steal time.
> 
> I think this is the first thing to look at.
>

The current code i have does the below: All of this happens in the Guest.
No change in host. (Host is running PowerVM, a non linux hypervisor)

At every 1second (configurable):
1. Low and High steal time thresholds are defined.(configurable)
2. Gathers steam time from all CPUs.
3. If it higher than the High threshold reduce the core(SMT8) usage by 1
4. If it lower than low threshould increase core usage by 1.
5. Avoid ping-pong as much as possible.

Its an initial code to try out if it works with plumbing the push current task framework
given in the series.

  
> Perhaps single_task_running() can be exposed in the x86 steal time data 
> structure, and in fact even in the rseq data for non-VM usecases?  This 
> is not specific to VMs and I'd like the steal time implementation to 
> follow the footsteps of rseq rather than the opposite.
> 
> Paolo
> 

Sorry, I didn't follow. You mean KVM usecases?

I don't know much about rseq(on todo list). Any specific implementation i could
look at done via rseq that you are talking about?