Add documentation for new cpumask called cpu_paravirt_mask. This could
help users in understanding what this mask and the concept behind it.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
Documentation/scheduler/sched-arch.rst | 37 ++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..6972c295013d 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,43 @@ Your cpu_idle routines need to obey the following rules:
arch/x86/kernel/process.c has examples of both polling and
sleeping idle functions.
+Paravirt CPUs
+=============
+
+Under virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VM's have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VM. i.e hypervisor need to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that VM's co-ordinate among themselves and ask for
+less CPU by not using some of the vCPUs. Such vCPUs where workload can be
+avoided at the moment for less vCPU preemption are called as "Paravirt CPUs".
+Note that when the pCPU contention goes away, these vCPUs can be used again
+by the workload.
+
+Arch need to set/unset the specific vCPU in cpu_paravirt_mask. When set, avoid
+that vCPU and when unset, use it as usual.
+
+Scheduler will try to avoid paravirt vCPUs as much as it can.
+This is achieved by
+1. Not selecting paravirt CPU at wakeup.
+2. Push the task away from paravirt CPU at tick.
+3. Not selecting paravirt CPU at load balance.
+
+This works only for SCHED_RT and SCHED_NORMAL. SCHED_EXT and userspace can make
+choices accordingly using cpu_paravirt_mask.
+
+/sys/devices/system/cpu/paravirt prints the current cpu_paravirt_mask in
+cpulist format.
+
+Notes:
+1. A task pinned only on paravirt CPUs will continue to run there.
+2. This feature is available under CONFIG_PARAVIRT
+3. Refer to PowerPC for architecure implementation side.
+4. Doesn't push out any task running on isolated CPUs.
Possible arch/ problems
=======================
--
2.47.3
On Wed, 19 Nov 2025 18:14:33 +0530 Shrikanth Hegde wrote: > Add documentation for new cpumask called cpu_paravirt_mask. This could > help users in understanding what this mask and the concept behind it. > > Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> > --- > Documentation/scheduler/sched-arch.rst | 37 ++++++++++++++++++++++++++ > 1 file changed, 37 insertions(+) > > diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst > index ed07efea7d02..6972c295013d 100644 > --- a/Documentation/scheduler/sched-arch.rst > +++ b/Documentation/scheduler/sched-arch.rst > @@ -62,6 +62,43 @@ Your cpu_idle routines need to obey the following rules: > arch/x86/kernel/process.c has examples of both polling and > sleeping idle functions. > > +Paravirt CPUs > +============= > + > +Under virtualised environments it is possible to overcommit CPU resources. > +i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical > +CPUs(pCPU). Under such conditions when all or many VM's have high utilization, > +hypervisor won't be able to satisfy the CPU requirement and has to context > +switch within or across VM. i.e hypervisor need to preempt one vCPU to run > +another. This is called vCPU preemption. This is more expensive compared to > +task context switch within a vCPU. > + What is missing is 1) vCPU preemption is X% more expensive compared to task context switch within a vCPU. > +In such cases it is better that VM's co-ordinate among themselves and ask for > +less CPU by not using some of the vCPUs. Such vCPUs where workload can be > +avoided at the moment for less vCPU preemption are called as "Paravirt CPUs". > +Note that when the pCPU contention goes away, these vCPUs can be used again > +by the workload. > + 2) given X, how to work out Y, the number of Paravirt CPUs for the simple scenario like 8 pCPUs and 16 vCPUs (8 vCPUs from VM1, 8 vCPUs from VM2)? > +Arch need to set/unset the specific vCPU in cpu_paravirt_mask. When set, avoid > +that vCPU and when unset, use it as usual. > + > +Scheduler will try to avoid paravirt vCPUs as much as it can. > +This is achieved by > +1. Not selecting paravirt CPU at wakeup. > +2. Push the task away from paravirt CPU at tick. > +3. Not selecting paravirt CPU at load balance. > + > +This works only for SCHED_RT and SCHED_NORMAL. SCHED_EXT and userspace can make > +choices accordingly using cpu_paravirt_mask. > + > +/sys/devices/system/cpu/paravirt prints the current cpu_paravirt_mask in > +cpulist format. > + > +Notes: > +1. A task pinned only on paravirt CPUs will continue to run there. > +2. This feature is available under CONFIG_PARAVIRT > +3. Refer to PowerPC for architecure implementation side. > +4. Doesn't push out any task running on isolated CPUs. > > Possible arch/ problems > ======================= > -- > 2.47.3
On 11/20/25 3:18 AM, Hillf Danton wrote: > On Wed, 19 Nov 2025 18:14:33 +0530 Shrikanth Hegde wrote: >> Add documentation for new cpumask called cpu_paravirt_mask. This could >> help users in understanding what this mask and the concept behind it. >> >> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> >> --- >> Documentation/scheduler/sched-arch.rst | 37 ++++++++++++++++++++++++++ >> 1 file changed, 37 insertions(+) >> >> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst >> index ed07efea7d02..6972c295013d 100644 >> --- a/Documentation/scheduler/sched-arch.rst >> +++ b/Documentation/scheduler/sched-arch.rst >> @@ -62,6 +62,43 @@ Your cpu_idle routines need to obey the following rules: >> arch/x86/kernel/process.c has examples of both polling and >> sleeping idle functions. >> >> +Paravirt CPUs >> +============= >> + >> +Under virtualised environments it is possible to overcommit CPU resources. >> +i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical >> +CPUs(pCPU). Under such conditions when all or many VM's have high utilization, >> +hypervisor won't be able to satisfy the CPU requirement and has to context >> +switch within or across VM. i.e hypervisor need to preempt one vCPU to run >> +another. This is called vCPU preemption. This is more expensive compared to >> +task context switch within a vCPU. >> + > What is missing is > 1) vCPU preemption is X% more expensive compared to task context switch within a vCPU. > This would change from arch to arch IMO. Will try to get numbers from PowerVM hypervisor. >> +In such cases it is better that VM's co-ordinate among themselves and ask for >> +less CPU by not using some of the vCPUs. Such vCPUs where workload can be >> +avoided at the moment for less vCPU preemption are called as "Paravirt CPUs". >> +Note that when the pCPU contention goes away, these vCPUs can be used again >> +by the workload. >> + > 2) given X, how to work out Y, the number of Paravirt CPUs for the simple > scenario like 8 pCPUs and 16 vCPUs (8 vCPUs from VM1, 8 vCPUs from VM2)? > Y need not be dependent on X. Note CPUs are marked as paravirt only when both VM's end up consuming all the CPU resource. Different cases: 1. VM1 is idle and VM2 is idle - No vCPUs are marked as paravirt. 2. VM1 is 100% busy and VM2 is idle - No steal time seen - No vCPUs is marked as paravirt. 3. VM1 is idle and VM2 is 100% busy - No steal time seen - No vCPUs is marked as paravirt. 4. VM1 is 100% busy and VM2 is 100% busy - 50% steal time would be seen in each - Since there are only 8 pCPUs (assuming each VM1 is allocated equally), 4 vCPUs in each VM will be marked as paravirt. Workload consolidates to remaining 4 vCPUs and hence no steal time will seen. Benefit would seen since host doesn't need to change expensive VM context switches. >> +Arch need to set/unset the specific vCPU in cpu_paravirt_mask. When set, avoid >> +that vCPU and when unset, use it as usual. >> + >> +Scheduler will try to avoid paravirt vCPUs as much as it can. >> +This is achieved by >> +1. Not selecting paravirt CPU at wakeup. >> +2. Push the task away from paravirt CPU at tick. >> +3. Not selecting paravirt CPU at load balance. >> + >> +This works only for SCHED_RT and SCHED_NORMAL. SCHED_EXT and userspace can make >> +choices accordingly using cpu_paravirt_mask. >> + >> +/sys/devices/system/cpu/paravirt prints the current cpu_paravirt_mask in >> +cpulist format. >> + >> +Notes: >> +1. A task pinned only on paravirt CPUs will continue to run there. >> +2. This feature is available under CONFIG_PARAVIRT >> +3. Refer to PowerPC for architecure implementation side. >> +4. Doesn't push out any task running on isolated CPUs. >> >> Possible arch/ problems >> ======================= >> -- >> 2.47.3
On Thu, 20 Nov 2025 20:24:13 +0530 Shrikanth Hegde wrote: > On 11/20/25 3:18 AM, Hillf Danton wrote: > > On Wed, 19 Nov 2025 18:14:33 +0530 Shrikanth Hegde wrote: > >> Add documentation for new cpumask called cpu_paravirt_mask. This could > >> help users in understanding what this mask and the concept behind it. > >> > >> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> > >> --- > >> Documentation/scheduler/sched-arch.rst | 37 ++++++++++++++++++++++++++ > >> 1 file changed, 37 insertions(+) > >> > >> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst > >> index ed07efea7d02..6972c295013d 100644 > >> --- a/Documentation/scheduler/sched-arch.rst > >> +++ b/Documentation/scheduler/sched-arch.rst > >> @@ -62,6 +62,43 @@ Your cpu_idle routines need to obey the following rules: > >> arch/x86/kernel/process.c has examples of both polling and > >> sleeping idle functions. > >> > >> +Paravirt CPUs > >> +============= > >> + > >> +Under virtualised environments it is possible to overcommit CPU resources. > >> +i.e sum of virtual CPU(vCPU) of all VM's is greater than number of physical > >> +CPUs(pCPU). Under such conditions when all or many VM's have high utilization, > >> +hypervisor won't be able to satisfy the CPU requirement and has to context > >> +switch within or across VM. i.e hypervisor need to preempt one vCPU to run > >> +another. This is called vCPU preemption. This is more expensive compared to > >> +task context switch within a vCPU. > >> + > > What is missing is > > 1) vCPU preemption is X% more expensive compared to task context switch within a vCPU. > > > > This would change from arch to arch IMO. Will try to get numbers from PowerVM hypervisor. > > >> +In such cases it is better that VM's co-ordinate among themselves and ask for > >> +less CPU by not using some of the vCPUs. Such vCPUs where workload can be > >> +avoided at the moment for less vCPU preemption are called as "Paravirt CPUs". > >> +Note that when the pCPU contention goes away, these vCPUs can be used again > >> +by the workload. > >> + > > 2) given X, how to work out Y, the number of Paravirt CPUs for the simple > > scenario like 8 pCPUs and 16 vCPUs (8 vCPUs from VM1, 8 vCPUs from VM2)? > > > > Y need not be dependent on X. Note CPUs are marked as paravirt only when both VM's > end up consuming all the CPU resource. > To check that dependence, the frequence of vCPU preemption can be set to 100HZ and the frequence of task context switch within a vCPU to 250HZ, on top of __zero__ Y (actually what we can do before this work), to compare with the result of whatever Y this work can select. BTW workload on vCPU can be compiling linux kernel with -j 8. > Different cases: > 1. VM1 is idle and VM2 is idle - No vCPUs are marked as paravirt. > 2. VM1 is 100% busy and VM2 is idle - No steal time seen - No vCPUs is marked as paravirt. > 3. VM1 is idle and VM2 is 100% busy - No steal time seen - No vCPUs is marked as paravirt. > 4. VM1 is 100% busy and VM2 is 100% busy - 50% steal time would be seen in each - > Since there are only 8 pCPUs (assuming each VM1 is allocated equally), 4 vCPUs in > each VM will be marked as paravirt. Workload consolidates to remaining 4 vCPUs and > hence no steal time will seen. Benefit would seen since host doesn't need to change > expensive VM context switches.
© 2016 - 2025 Red Hat, Inc.