cpu avoid state and push task mechanism

[RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept

Posted by Shrikanth Hegde 7 months, 2 weeks ago

This describes what avoid CPU means and what scheduler aims to do 
when a CPU is marked as avoid. 

Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
 Documentation/scheduler/sched-arch.rst | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..d32755298fca 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,31 @@ Your cpu_idle routines need to obey the following rules:
 arch/x86/kernel/process.c has examples of both polling and
 sleeping idle functions.
 
+CPU Avoid
+=========
+
+Under paravirt conditions it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VM is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VM have high utilization,
+hypervisor won't be able to satisfy the requirement and has to context switch
+within or across VM. VM level context switch is more expensive compared to
+task context switch within the VM.
+
+In such cases it is better that VM's co-ordinate among themselves and ask for
+less CPU request by not using some of the vCPUs. Such vCPUs where workload
+can be avoided at the moment are called as "Avoid CPUs". Note that when the
+pCPU contention goes away, these vCPUs can be used again by the workload.
+
+Arch need to set/unset the vCPU as avoid in cpu_avoid_mask. When set, avoid
+the CPU and when unset, use it as usual.
+
+Scheduler will try to avoid those CPUs as much as it can.
+This is achived by
+1. Not selecting those CPU at wakeup.
+2. Push the task away from avoid CPU at tick.
+3. Not selecting avoid CPU at load balance.
+
+This works only for SCHED_RT and SCHED_NORMAL.
 
 Possible arch/ problems
 =======================
-- 
2.43.0

Re: [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept

Posted by Hillf Danton 7 months, 2 weeks ago

On Thu, 26 Jun 2025 00:41:00 +0530 Shrikanth Hegde wrote
> This describes what avoid CPU means and what scheduler aims to do 
> when a CPU is marked as avoid. 
> 
> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> ---
>  Documentation/scheduler/sched-arch.rst | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
> index ed07efea7d02..d32755298fca 100644
> --- a/Documentation/scheduler/sched-arch.rst
> +++ b/Documentation/scheduler/sched-arch.rst
> @@ -62,6 +62,31 @@ Your cpu_idle routines need to obey the following rules:
>  arch/x86/kernel/process.c has examples of both polling and
>  sleeping idle functions.
>  
> +CPU Avoid
> +=========
> +
> +Under paravirt conditions it is possible to overcommit CPU resources.
> +i.e sum of virtual CPU(vCPU) of all VM is greater than number of physical
> +CPUs(pCPU). Under such conditions when all or many VM have high utilization,
> +hypervisor won't be able to satisfy the requirement and has to context switch
> +within or across VM. VM level context switch is more expensive compared to
> +task context switch within the VM.
> +
Sounds like VMs not well configured (or pCPUs not well partationed).

> +In such cases it is better that VM's co-ordinate among themselves and ask for
> +less CPU request by not using some of the vCPUs. Such vCPUs where workload
> +can be avoided at the moment are called as "Avoid CPUs". Note that when the
> +pCPU contention goes away, these vCPUs can be used again by the workload.
> +
In the car cockpit scenario for example with type1 hypervisor, there is app
kicking watchdog bound to every vCPU, so no vCPU should be avoided.

> +Arch need to set/unset the vCPU as avoid in cpu_avoid_mask. When set, avoid
> +the CPU and when unset, use it as usual.
> +
> +Scheduler will try to avoid those CPUs as much as it can.
> +This is achived by
> +1. Not selecting those CPU at wakeup.
> +2. Push the task away from avoid CPU at tick.
> +3. Not selecting avoid CPU at load balance.
> +
> +This works only for SCHED_RT and SCHED_NORMAL.
>  
Sounds like forcing a pill down through Peter's throat because Steve's headache.

Re: [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept

Posted by Shrikanth Hegde 7 months, 2 weeks ago

Hi Hillf.

> On Thu, 26 Jun 2025 00:41:00 +0530 Shrikanth Hegde wrote
>> This describes what avoid CPU means and what scheduler aims to do
>> when a CPU is marked as avoid.
>>
>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>> ---
>>   Documentation/scheduler/sched-arch.rst | 25 +++++++++++++++++++++++++
>>   1 file changed, 25 insertions(+)
>>
>> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
>> index ed07efea7d02..d32755298fca 100644
>> --- a/Documentation/scheduler/sched-arch.rst
>> +++ b/Documentation/scheduler/sched-arch.rst
>> @@ -62,6 +62,31 @@ Your cpu_idle routines need to obey the following rules:
>>   arch/x86/kernel/process.c has examples of both polling and
>>   sleeping idle functions.
>>   
>> +CPU Avoid
>> +=========
>> +
>> +Under paravirt conditions it is possible to overcommit CPU resources.
>> +i.e sum of virtual CPU(vCPU) of all VM is greater than number of physical
>> +CPUs(pCPU). Under such conditions when all or many VM have high utilization,
>> +hypervisor won't be able to satisfy the requirement and has to context switch
>> +within or across VM. VM level context switch is more expensive compared to
>> +task context switch within the VM.
>> +
> Sounds like VMs not well configured (or pCPUs not well partationed).

No. That's how VMs under paravirtulized case configured as i understand.
Correct me if i am wrong.

On powerpc, we have Shared Processor Logical partitions (SPLPAR) which allows overcommit.
When other LPAR(VM) are idle, by having overcommit one could get more work done. This allows one
to configure more VMs too. The said issue happens only when every/most VMs ask for
CPU at the same time.

> 
>> +In such cases it is better that VM's co-ordinate among themselves and ask for
>> +less CPU request by not using some of the vCPUs. Such vCPUs where workload
>> +can be avoided at the moment are called as "Avoid CPUs". Note that when the
>> +pCPU contention goes away, these vCPUs can be used again by the workload.
>> +
> In the car cockpit scenario for example with type1 hypervisor, there is app
> kicking watchdog bound to every vCPU, so no vCPU should be avoided.

I don't understand what is meant here. Any reference links? Also in such cases,
arch shouldn't set any CPU as avoid. But it may not get this feature benefit.

> 
>> +Arch need to set/unset the vCPU as avoid in cpu_avoid_mask. When set, avoid
>> +the CPU and when unset, use it as usual.
>> +
>> +Scheduler will try to avoid those CPUs as much as it can.
>> +This is achived by
>> +1. Not selecting those CPU at wakeup.
>> +2. Push the task away from avoid CPU at tick.
>> +3. Not selecting avoid CPU at load balance.
>> +
>> +This works only for SCHED_RT and SCHED_NORMAL.
>>   
> Sounds like forcing a pill down through Peter's throat because Steve's headache.

I meant, this series till now address only RT and NORMAL. It could be made work for other classes too.
But i didn't see a point.

Since the mask is available, SCHED_EXT one could design their BPF hooks accordingly and SCHED_DL isn't designed to
work under such conditions. I don't know any user/workload which deploys SCHED_DL in CPU over-commited cases.

Re: [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept

Posted by Hillf Danton 7 months, 2 weeks ago

On Thu, 26 Jun 2025 20:16:36 +0530 Shrikanth Hegde wrote
> > On Thu, 26 Jun 2025 00:41:00 +0530 Shrikanth Hegde wrote
> >> This describes what avoid CPU means and what scheduler aims to do
> >> when a CPU is marked as avoid.
> >>
> >> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> >> ---
> >>   Documentation/scheduler/sched-arch.rst | 25 +++++++++++++++++++++++++
> >>   1 file changed, 25 insertions(+)
> >>
> >> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
> >> index ed07efea7d02..d32755298fca 100644
> >> --- a/Documentation/scheduler/sched-arch.rst
> >> +++ b/Documentation/scheduler/sched-arch.rst
> >> @@ -62,6 +62,31 @@ Your cpu_idle routines need to obey the following rules:
> >>   arch/x86/kernel/process.c has examples of both polling and
> >>   sleeping idle functions.
> >>   
> >> +CPU Avoid
> >> +=========
> >> +
> >> +Under paravirt conditions it is possible to overcommit CPU resources.
> >> +i.e sum of virtual CPU(vCPU) of all VM is greater than number of physical
> >> +CPUs(pCPU). Under such conditions when all or many VM have high utilization,
> >> +hypervisor won't be able to satisfy the requirement and has to context switch
> >> +within or across VM. VM level context switch is more expensive compared to
> >> +task context switch within the VM.
> >> +
> > Sounds like VMs not well configured (or pCPUs not well partationed).
> 
> No. That's how VMs under paravirtulized case configured as i understand.
> Correct me if i am wrong.
> 
> On powerpc, we have Shared Processor Logical partitions (SPLPAR) which allows overcommit.
> When other LPAR(VM) are idle, by having overcommit one could get more work done. This allows one
> to configure more VMs too. The said issue happens only when every/most VMs ask for
> CPU at the same time.
> 
After putting virtualization aside, lets see a simpler case where more
than 1024 apps are bound to a single (ppc having 4 CPUs for instance) CPU,
what can we do wrt app responsibility in kernel? Nothing because
resource/budget is never enough without sane config.

Re: [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept

Posted by Shrikanth Hegde 7 months, 2 weeks ago


On 6/27/25 05:57, Hillf Danton wrote:
> On Thu, 26 Jun 2025 20:16:36 +0530 Shrikanth Hegde wrote
>>> On Thu, 26 Jun 2025 00:41:00 +0530 Shrikanth Hegde wrote
>>>> This describes what avoid CPU means and what scheduler aims to do
>>>> when a CPU is marked as avoid.
>>>>
>>>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
>>>> ---
>>>>    Documentation/scheduler/sched-arch.rst | 25 +++++++++++++++++++++++++
>>>>    1 file changed, 25 insertions(+)
>>>>
>>>> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
>>>> index ed07efea7d02..d32755298fca 100644
>>>> --- a/Documentation/scheduler/sched-arch.rst
>>>> +++ b/Documentation/scheduler/sched-arch.rst
>>>> @@ -62,6 +62,31 @@ Your cpu_idle routines need to obey the following rules:
>>>>    arch/x86/kernel/process.c has examples of both polling and
>>>>    sleeping idle functions.
>>>>    
>>>> +CPU Avoid
>>>> +=========
>>>> +
>>>> +Under paravirt conditions it is possible to overcommit CPU resources.
>>>> +i.e sum of virtual CPU(vCPU) of all VM is greater than number of physical
>>>> +CPUs(pCPU). Under such conditions when all or many VM have high utilization,
>>>> +hypervisor won't be able to satisfy the requirement and has to context switch
>>>> +within or across VM. VM level context switch is more expensive compared to
>>>> +task context switch within the VM.
>>>> +
>>> Sounds like VMs not well configured (or pCPUs not well partationed).
>>
>> No. That's how VMs under paravirtulized case configured as i understand.
>> Correct me if i am wrong.
>>
>> On powerpc, we have Shared Processor Logical partitions (SPLPAR) which allows overcommit.
>> When other LPAR(VM) are idle, by having overcommit one could get more work done. This allows one
>> to configure more VMs too. The said issue happens only when every/most VMs ask for
>> CPU at the same time.
>>
> After putting virtualization aside, lets see a simpler case where more
> than 1024 apps are bound to a single (ppc having 4 CPUs for instance) CPU,
> what can we do wrt app responsibility in kernel? 

In this case you will not likely have vCPU preemption. you will have 
task preemption. That is ok. Patch doesn't aim to solve the case you 
have mentioned above.

In the generic SPLPAR configuration virtual processor usually have large 
number of vCPUs and powerpc systems are fairly large in terms of CPU as 
well.

I hope that answers.

Re: [RFC v2 1/9] sched/docs: Document avoid_cpu_mask and avoid CPU concept

Posted by Hillf Danton 7 months, 1 week ago

On Fri, 27 Jun 2025 10:07:22 +0530 Shrikanth Hegde wrote
> On 6/27/25 05:57, Hillf Danton wrote:
> > On Thu, 26 Jun 2025 20:16:36 +0530 Shrikanth Hegde wrote
> >>> On Thu, 26 Jun 2025 00:41:00 +0530 Shrikanth Hegde wrote
> >>>> This describes what avoid CPU means and what scheduler aims to do
> >>>> when a CPU is marked as avoid.
> >>>>
> >>>> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
> >>>> ---
> >>>>    Documentation/scheduler/sched-arch.rst | 25 +++++++++++++++++++++++++
> >>>>    1 file changed, 25 insertions(+)
> >>>>
> >>>> diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
> >>>> index ed07efea7d02..d32755298fca 100644
> >>>> --- a/Documentation/scheduler/sched-arch.rst
> >>>> +++ b/Documentation/scheduler/sched-arch.rst
> >>>> @@ -62,6 +62,31 @@ Your cpu_idle routines need to obey the following rules:
> >>>>    arch/x86/kernel/process.c has examples of both polling and
> >>>>    sleeping idle functions.
> >>>>    
> >>>> +CPU Avoid
> >>>> +=========
> >>>> +
> >>>> +Under paravirt conditions it is possible to overcommit CPU resources.
> >>>> +i.e sum of virtual CPU(vCPU) of all VM is greater than number of physical
> >>>> +CPUs(pCPU). Under such conditions when all or many VM have high utilization,
> >>>> +hypervisor won't be able to satisfy the requirement and has to context switch
> >>>> +within or across VM. VM level context switch is more expensive compared to
> >>>> +task context switch within the VM.
> >>>> +
> >>> Sounds like VMs not well configured (or pCPUs not well partationed).
> >>
> >> No. That's how VMs under paravirtulized case configured as i understand.
> >> Correct me if i am wrong.
> >>
> >> On powerpc, we have Shared Processor Logical partitions (SPLPAR) which allows overcommit.
> >> When other LPAR(VM) are idle, by having overcommit one could get more work done. This allows one
> >> to configure more VMs too. The said issue happens only when every/most VMs ask for
> >> CPU at the same time.
> >>
> > After putting virtualization aside, lets see a simpler case where more
> > than 1024 apps are bound to a single (ppc having 4 CPUs for instance) CPU,
> > what can we do wrt app responsibility in kernel? 
> 
> In this case you will not likely have vCPU preemption. you will have 
> task preemption. That is ok. Patch doesn't aim to solve the case you 
> have mentioned above.
> 
It is a case of overcommit due to mis-config where scheduler does not
help simply because kernel is not the pill that kills all pains.

> In the generic SPLPAR configuration virtual processor usually have large 
> number of vCPUs and powerpc systems are fairly large in terms of CPU as 
> well.
>
Overcommit is not SPLPAR specific, nor PPC, because it is buggy for scheduler
to create overcommit on either PPC or Arm64.