[v1] Paravirt CPUs and push task for less vCPU preemption

[PATCH 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept

Posted by Shrikanth Hegde 2 months, 3 weeks ago

Add documentation for new cpumask called cpu_paravirt_mask. This could
help users in understanding what this mask and the concept behind it.

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/scheduler/sched-arch.rst | 37 ++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..6972c295013d 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,43 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+Paravirt CPUs
+=============
+
+Under virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VM. i.e hypervisor need to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that VM's co-ordinate among themselves and ask for
+less CPU by not using some of the vCPUs. Such vCPUs where workload can be
+avoided at the moment for less vCPU preemption are called as "Paravirt CPUs".
+Note that when the pCPU contention goes away, these vCPUs can be used again
+by the workload.
+
+Arch need to set/unset the specific vCPU in cpu_paravirt_mask. When set, avoid
+that vCPU and when unset, use it as usual.
+
+Scheduler will try to avoid paravirt vCPUs as much as it can.
+This is achieved by
+1. Not selecting paravirt CPU at wakeup.
+2. Push the task away from paravirt CPU at tick.
+3. Not selecting paravirt CPU at load balance.
+
+This works only for SCHED_RT and SCHED_NORMAL. SCHED_EXT and userspace can make
+choices accordingly using cpu_paravirt_mask.
+
+/sys/devices/system/cpu/paravirt prints the current cpu_paravirt_mask in
+cpulist format.
+
+Notes:
+1. A task pinned only on paravirt CPUs will continue to run there.
+2. This feature is available under CONFIG_PARAVIRT
+3. Refer to PowerPC for architecure implementation side.
+4. Doesn't push out any task running on isolated CPUs.
 
 Possible arch/ problems
 =======================
-- 
2.47.3

Re: [PATCH 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept

Posted by Hillf Danton 2 months, 2 weeks ago

On Wed, 19 Nov 2025 18:14:33 +0530 Shrikanth Hegde wrote:
> Add documentation for new cpumask called cpu_paravirt_mask. This could
> help users in understanding what this mask and the concept behind it.
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  Documentation/scheduler/sched-arch.rst | 37 ++++++++++++++++++++++++++
>  1 file changed, 37 insertions(+)
> 
> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
> index ed07efea7d02..6972c295013d 100644
> --- a/Documentation/scheduler/sched-arch.rst
> +++ b/Documentation/scheduler/sched-arch.rst
> @@ -62,6 +62,43 @@ Your cpu_idle routines need to obey the following rules:
>  arch/x86/kernel/process.c has examples of both polling and
>  sleeping idle functions.
>  
> +Paravirt CPUs
> +=============
> +
> +Under virtualised environments it is possible to overcommit CPU resources.
> +i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
> +CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
> +hypervisor won't be able to satisfy the CPU requirement and has to context
> +switch within or across VM. i.e hypervisor need to preempt one vCPU to run
> +another. This is called vCPU preemption. This is more expensive compared to
> +task context switch within a vCPU.
> +
What is missing is
1) vCPU preemption is X% more expensive compared to task context switch within a vCPU.

> +In such cases it is better that VM's co-ordinate among themselves and ask for
> +less CPU by not using some of the vCPUs. Such vCPUs where workload can be
> +avoided at the moment for less vCPU preemption are called as "Paravirt CPUs".
> +Note that when the pCPU contention goes away, these vCPUs can be used again
> +by the workload.
> +
2) given X, how to work out Y, the number of Paravirt CPUs for the simple
scenario like 8 pCPUs and 16 vCPUs (8 vCPUs from VM1, 8 vCPUs from VM2)?

> +Arch need to set/unset the specific vCPU in cpu_paravirt_mask. When set, avoid
> +that vCPU and when unset, use it as usual.
> +
> +Scheduler will try to avoid paravirt vCPUs as much as it can.
> +This is achieved by
> +1. Not selecting paravirt CPU at wakeup.
> +2. Push the task away from paravirt CPU at tick.
> +3. Not selecting paravirt CPU at load balance.
> +
> +This works only for SCHED_RT and SCHED_NORMAL. SCHED_EXT and userspace can make
> +choices accordingly using cpu_paravirt_mask.
> +
> +/sys/devices/system/cpu/paravirt prints the current cpu_paravirt_mask in
> +cpulist format.
> +
> +Notes:
> +1. A task pinned only on paravirt CPUs will continue to run there.
> +2. This feature is available under CONFIG_PARAVIRT
> +3. Refer to PowerPC for architecure implementation side.
> +4. Doesn't push out any task running on isolated CPUs.
>  
>  Possible arch/ problems
>  =======================
> -- 
> 2.47.3

Re: [PATCH 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept

Posted by Shrikanth Hegde 2 months, 2 weeks ago


On 11/20/25 3:18 AM, Hillf Danton wrote:
> On Wed, 19 Nov 2025 18:14:33 +0530 Shrikanth Hegde wrote:
>> Add documentation for new cpumask called cpu_paravirt_mask. This could
>> help users in understanding what this mask and the concept behind it.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   Documentation/scheduler/sched-arch.rst | 37 ++++++++++++++++++++++++++
>>   1 file changed, 37 insertions(+)
>>
>> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
>> index ed07efea7d02..6972c295013d 100644
>> --- a/Documentation/scheduler/sched-arch.rst
>> +++ b/Documentation/scheduler/sched-arch.rst
>> @@ -62,6 +62,43 @@ Your cpu_idle routines need to obey the following rules:
>>   arch/x86/kernel/process.c has examples of both polling and
>>   sleeping idle functions.
>>   
>> +Paravirt CPUs
>> +=============
>> +
>> +Under virtualised environments it is possible to overcommit CPU resources.
>> +i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
>> +CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
>> +hypervisor won't be able to satisfy the CPU requirement and has to context
>> +switch within or across VM. i.e hypervisor need to preempt one vCPU to run
>> +another. This is called vCPU preemption. This is more expensive compared to
>> +task context switch within a vCPU.
>> +
> What is missing is
> 1) vCPU preemption is X% more expensive compared to task context switch within a vCPU.
> 

This would change from arch to arch IMO. Will try to get numbers from PowerVM hypervisor.

>> +In such cases it is better that VM's co-ordinate among themselves and ask for
>> +less CPU by not using some of the vCPUs. Such vCPUs where workload can be
>> +avoided at the moment for less vCPU preemption are called as "Paravirt CPUs".
>> +Note that when the pCPU contention goes away, these vCPUs can be used again
>> +by the workload.
>> +
> 2) given X, how to work out Y, the number of Paravirt CPUs for the simple
> scenario like 8 pCPUs and 16 vCPUs (8 vCPUs from VM1, 8 vCPUs from VM2)?
> 

Y need not be dependent on X. Note CPUs are marked as paravirt only when both VM's
end up consuming all the CPU resource.

Different cases:
1. VM1 is idle and VM2 is idle - No vCPUs are marked as paravirt.
2. VM1 is 100% busy and VM2 is idle - No steal time seen - No vCPUs is marked as paravirt.
3. VM1 is idle and VM2 is 100% busy - No steal time seen - No vCPUs is marked as paravirt.
4. VM1 is 100% busy and VM2 is 100% busy - 50% steal time would be seen in each -
	Since there are only 8 pCPUs (assuming each VM1 is allocated equally), 4 vCPUs in
	each VM will be marked as paravirt. Workload consolidates to remaining 4 vCPUs and
	hence no steal time will seen. Benefit would seen since host doesn't need to change
	expensive VM context switches.

>> +Arch need to set/unset the specific vCPU in cpu_paravirt_mask. When set, avoid
>> +that vCPU and when unset, use it as usual.
>> +
>> +Scheduler will try to avoid paravirt vCPUs as much as it can.
>> +This is achieved by
>> +1. Not selecting paravirt CPU at wakeup.
>> +2. Push the task away from paravirt CPU at tick.
>> +3. Not selecting paravirt CPU at load balance.
>> +
>> +This works only for SCHED_RT and SCHED_NORMAL. SCHED_EXT and userspace can make
>> +choices accordingly using cpu_paravirt_mask.
>> +
>> +/sys/devices/system/cpu/paravirt prints the current cpu_paravirt_mask in
>> +cpulist format.
>> +
>> +Notes:
>> +1. A task pinned only on paravirt CPUs will continue to run there.
>> +2. This feature is available under CONFIG_PARAVIRT
>> +3. Refer to PowerPC for architecure implementation side.
>> +4. Doesn't push out any task running on isolated CPUs.
>>   
>>   Possible arch/ problems
>>   =======================
>> -- 
>> 2.47.3

Re: [PATCH 01/17] sched/docs: Document cpu_paravirt_mask and Paravirt CPU concept

Posted by Hillf Danton 2 months, 2 weeks ago

On Thu, 20 Nov 2025 20:24:13 +0530 Shrikanth Hegde wrote:
> On 11/20/25 3:18 AM, Hillf Danton wrote:
> > On Wed, 19 Nov 2025 18:14:33 +0530 Shrikanth Hegde wrote:
> >> Add documentation for new cpumask called cpu_paravirt_mask. This could
> >> help users in understanding what this mask and the concept behind it.
> >>
> >> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> >> ---
> >>   Documentation/scheduler/sched-arch.rst | 37 ++++++++++++++++++++++++++
> >>   1 file changed, 37 insertions(+)
> >>
> >> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
> >> index ed07efea7d02..6972c295013d 100644
> >> --- a/Documentation/scheduler/sched-arch.rst
> >> +++ b/Documentation/scheduler/sched-arch.rst
> >> @@ -62,6 +62,43 @@ Your cpu_idle routines need to obey the following rules:
> >>   arch/x86/kernel/process.c has examples of both polling and
> >>   sleeping idle functions.
> >>   
> >> +Paravirt CPUs
> >> +=============
> >> +
> >> +Under virtualised environments it is possible to overcommit CPU resources.
> >> +i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
> >> +CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
> >> +hypervisor won't be able to satisfy the CPU requirement and has to context
> >> +switch within or across VM. i.e hypervisor need to preempt one vCPU to run
> >> +another. This is called vCPU preemption. This is more expensive compared to
> >> +task context switch within a vCPU.
> >> +
> > What is missing is
> > 1) vCPU preemption is X% more expensive compared to task context switch within a vCPU.
> > 
> 
> This would change from arch to arch IMO. Will try to get numbers from PowerVM hypervisor.
> 
> >> +In such cases it is better that VM's co-ordinate among themselves and ask for
> >> +less CPU by not using some of the vCPUs. Such vCPUs where workload can be
> >> +avoided at the moment for less vCPU preemption are called as "Paravirt CPUs".
> >> +Note that when the pCPU contention goes away, these vCPUs can be used again
> >> +by the workload.
> >> +
> > 2) given X, how to work out Y, the number of Paravirt CPUs for the simple
> > scenario like 8 pCPUs and 16 vCPUs (8 vCPUs from VM1, 8 vCPUs from VM2)?
> > 
> 
> Y need not be dependent on X. Note CPUs are marked as paravirt only when both VM's
> end up consuming all the CPU resource.
> 
To check that dependence, the frequence of vCPU preemption can be set to
100HZ and the frequence of task context switch within a vCPU to 250HZ,
on top of __zero__ Y (actually what we can do before this work), to compare
with the result of whatever Y this work can select.

BTW workload on vCPU can be compiling linux kernel with -j 8.

> Different cases:
> 1. VM1 is idle and VM2 is idle - No vCPUs are marked as paravirt.
> 2. VM1 is 100% busy and VM2 is idle - No steal time seen - No vCPUs is marked as paravirt.
> 3. VM1 is idle and VM2 is 100% busy - No steal time seen - No vCPUs is marked as paravirt.
> 4. VM1 is 100% busy and VM2 is 100% busy - 50% steal time would be seen in each -
> 	Since there are only 8 pCPUs (assuming each VM1 is allocated equally), 4 vCPUs in
> 	each VM will be marked as paravirt. Workload consolidates to remaining 4 vCPUs and
> 	hence no steal time will seen. Benefit would seen since host doesn't need to change
> 	expensive VM context switches.