sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

[PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Joel Fernandes 10 months, 3 weeks ago

Consider that the previous CPU is cache affined to the waker's CPU and
is idle. Currently, scx's default select function only selects the
previous CPU in this case if WF_SYNC request is also made to wakeup on the
waker's CPU.

This means, without WF_SYNC, the previous CPU being cache affined to the
waker and is idle is not considered. This seems extreme. WF_SYNC is not
normally passed to the wakeup path outside of some IPC drivers but it is
very possible that the task is cache hot on previous CPU and shares
cache with the waker CPU. Lets avoid too many migrations and select the
previous CPU in such cases.

This change is consistent with the fair scheduler's behavior as well. In
select_idle_sibling(), the previous CPU is selected if it is cache
affined with the target. This is done regardless of WF_SYNC and before
any scanning of fully idle cores is done.

One difference still exists though between SCX and CFS in this regard, in CFS
we first check if the target CPU is idle before checking if the previous CPU is
idle. However that could be a matter of choice and in the future, that behavior
could also be unified.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/ext.c | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5a81d9a1e31f..3b1a489e2aaf 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3479,7 +3479,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 {
 	const struct cpumask *llc_cpus = NULL;
 	const struct cpumask *numa_cpus = NULL;
-	s32 cpu;
+	s32 cpu = smp_processor_id();
 
 	*found = false;
 
@@ -3507,22 +3507,20 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			llc_cpus = llc_span(prev_cpu);
 	}
 
+	/*
+	 * If the waker's CPU is cache affine and prev_cpu is idle, then avoid
+	 * a migration.
+	 */
+	if (cpus_share_cache(cpu, prev_cpu) &&
+		test_and_clear_cpu_idle(prev_cpu)) {
+		cpu = prev_cpu;
+		goto cpu_found;
+	}
+
 	/*
 	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
 	 */
 	if (wake_flags & SCX_WAKE_SYNC) {
-		cpu = smp_processor_id();
-
-		/*
-		 * If the waker's CPU is cache affine and prev_cpu is idle,
-		 * then avoid a migration.
-		 */
-		if (cpus_share_cache(cpu, prev_cpu) &&
-		    test_and_clear_cpu_idle(prev_cpu)) {
-			cpu = prev_cpu;
-			goto cpu_found;
-		}
-
 		/*
 		 * If the waker's local DSQ is empty, and the system is under
 		 * utilized, try to wake up @p to the local DSQ of the waker.
-- 
2.43.0

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Andrea Righi 10 months, 3 weeks ago

Hi Joel,

On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote:
> Consider that the previous CPU is cache affined to the waker's CPU and
> is idle. Currently, scx's default select function only selects the
> previous CPU in this case if WF_SYNC request is also made to wakeup on the
> waker's CPU.
> 
> This means, without WF_SYNC, the previous CPU being cache affined to the
> waker and is idle is not considered. This seems extreme. WF_SYNC is not
> normally passed to the wakeup path outside of some IPC drivers but it is
> very possible that the task is cache hot on previous CPU and shares
> cache with the waker CPU. Lets avoid too many migrations and select the
> previous CPU in such cases.
> 
> This change is consistent with the fair scheduler's behavior as well. In
> select_idle_sibling(), the previous CPU is selected if it is cache
> affined with the target. This is done regardless of WF_SYNC and before
> any scanning of fully idle cores is done.
> 
> One difference still exists though between SCX and CFS in this regard, in CFS
> we first check if the target CPU is idle before checking if the previous CPU is
> idle. However that could be a matter of choice and in the future, that behavior
> could also be unified.
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/sched/ext.c | 24 +++++++++++-------------
>  1 file changed, 11 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 5a81d9a1e31f..3b1a489e2aaf 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3479,7 +3479,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  {
>  	const struct cpumask *llc_cpus = NULL;
>  	const struct cpumask *numa_cpus = NULL;
> -	s32 cpu;
> +	s32 cpu = smp_processor_id();
>  
>  	*found = false;
>  
> @@ -3507,22 +3507,20 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  			llc_cpus = llc_span(prev_cpu);
>  	}
>  
> +	/*
> +	 * If the waker's CPU is cache affine and prev_cpu is idle, then avoid
> +	 * a migration.
> +	 */
> +	if (cpus_share_cache(cpu, prev_cpu) &&
> +		test_and_clear_cpu_idle(prev_cpu)) {
> +		cpu = prev_cpu;
> +		goto cpu_found;
> +	}
> +

One potential issue that I see is that when SMT is enabled, you may end up
using prev_cpu also when the other sibling is busy. Maybe we should check
if prev_cpu is set in the SMT idle cpumask.

Also, last time I tried a similar change I was regressing a lot of
benchmarks. Maybe we should repeat the tests and get some numbers.

>  	/*
>  	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
>  	 */
>  	if (wake_flags & SCX_WAKE_SYNC) {
> -		cpu = smp_processor_id();
> -
> -		/*
> -		 * If the waker's CPU is cache affine and prev_cpu is idle,
> -		 * then avoid a migration.
> -		 */
> -		if (cpus_share_cache(cpu, prev_cpu) &&
> -		    test_and_clear_cpu_idle(prev_cpu)) {
> -			cpu = prev_cpu;
> -			goto cpu_found;
> -		}
> -
>  		/*
>  		 * If the waker's local DSQ is empty, and the system is under
>  		 * utilized, try to wake up @p to the local DSQ of the waker.
> -- 
> 2.43.0
> 

Thanks,
-Andrea

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Joel Fernandes 10 months, 3 weeks ago


On 3/17/2025 6:20 PM, Andrea Righi wrote:
> Hi Joel,
> 
> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote:
>> Consider that the previous CPU is cache affined to the waker's CPU and
>> is idle. Currently, scx's default select function only selects the
>> previous CPU in this case if WF_SYNC request is also made to wakeup on the
>> waker's CPU.
>>
>> This means, without WF_SYNC, the previous CPU being cache affined to the
>> waker and is idle is not considered. This seems extreme. WF_SYNC is not
>> normally passed to the wakeup path outside of some IPC drivers but it is
>> very possible that the task is cache hot on previous CPU and shares
>> cache with the waker CPU. Lets avoid too many migrations and select the
>> previous CPU in such cases.
>>
>> This change is consistent with the fair scheduler's behavior as well. In
>> select_idle_sibling(), the previous CPU is selected if it is cache
>> affined with the target. This is done regardless of WF_SYNC and before
>> any scanning of fully idle cores is done.
>>
>> One difference still exists though between SCX and CFS in this regard, in CFS
>> we first check if the target CPU is idle before checking if the previous CPU is
>> idle. However that could be a matter of choice and in the future, that behavior
>> could also be unified.
>>
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> ---
>>  kernel/sched/ext.c | 24 +++++++++++-------------
>>  1 file changed, 11 insertions(+), 13 deletions(-)
>>
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 5a81d9a1e31f..3b1a489e2aaf 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -3479,7 +3479,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>>  {
>>  	const struct cpumask *llc_cpus = NULL;
>>  	const struct cpumask *numa_cpus = NULL;
>> -	s32 cpu;
>> +	s32 cpu = smp_processor_id();
>>  
>>  	*found = false;
>>  
>> @@ -3507,22 +3507,20 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>>  			llc_cpus = llc_span(prev_cpu);
>>  	}
>>  
>> +	/*
>> +	 * If the waker's CPU is cache affine and prev_cpu is idle, then avoid
>> +	 * a migration.
>> +	 */
>> +	if (cpus_share_cache(cpu, prev_cpu) &&
>> +		test_and_clear_cpu_idle(prev_cpu)) {
>> +		cpu = prev_cpu;
>> +		goto cpu_found;
>> +	}
>> +
> 
> One potential issue that I see is that when SMT is enabled, you may end up
> using prev_cpu also when the other sibling is busy. Maybe we should check
> if prev_cpu is set in the SMT idle cpumask.

Hmm so you're suggesting select previous if it is part of a fully idle core
*and* shares cache with the waker.

That does sound reasonable to me, will look more into it, thanks.

 - Joel

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Tejun Heo 10 months, 3 weeks ago

Hello, Joel.

On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote:
> Consider that the previous CPU is cache affined to the waker's CPU and
> is idle. Currently, scx's default select function only selects the
> previous CPU in this case if WF_SYNC request is also made to wakeup on the
> waker's CPU.
> 
> This means, without WF_SYNC, the previous CPU being cache affined to the
> waker and is idle is not considered. This seems extreme. WF_SYNC is not
> normally passed to the wakeup path outside of some IPC drivers but it is
> very possible that the task is cache hot on previous CPU and shares
> cache with the waker CPU. Lets avoid too many migrations and select the
> previous CPU in such cases.

Hmm.. if !WF_SYNC:

1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle
   core in widening scopes.

2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an
   idle CPU in widening scopes.

So, it is considering prev_cpu, right? I think it's preferring idle core a
bit too much - it probably doesn't make sense to cross the NUMA boundary if
there is an idle CPU in this node, at least.

Isn't the cpus_share_cache() code block mostly about not doing
waker-affining if prev_cpu of the wakee is close enough and idle, so
waker-affining is likely to be worse?

Thanks.

-- 
tejun

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Libo Chen 10 months, 3 weeks ago


On 3/17/25 10:08, Tejun Heo wrote:
> Hello, Joel.
> 
> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote:
>> Consider that the previous CPU is cache affined to the waker's CPU and
>> is idle. Currently, scx's default select function only selects the
>> previous CPU in this case if WF_SYNC request is also made to wakeup on the
>> waker's CPU.
>>
>> This means, without WF_SYNC, the previous CPU being cache affined to the
>> waker and is idle is not considered. This seems extreme. WF_SYNC is not
>> normally passed to the wakeup path outside of some IPC drivers but it is
>> very possible that the task is cache hot on previous CPU and shares
>> cache with the waker CPU. Lets avoid too many migrations and select the
>> previous CPU in such cases.
> 
> Hmm.. if !WF_SYNC:
> 
> 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle
>    core in widening scopes.
> 
> 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an
>    idle CPU in widening scopes.
> 
> So, it is considering prev_cpu, right? I think it's preferring idle core a
> bit too much - it probably doesn't make sense to cross the NUMA boundary if
> there is an idle CPU in this node, at least.
> 
Hi Tejun,

Just as Peter said, this is whole wake affinity thing is highly workload
dependent. We have seen cases where even if there are idle cores in the
prev node, it's still beneficial to cross the NUMA boundary. Will it make
more sense to have the BPF scheduler implement/alter some of logic here so
it can fit to different workload?
 
> Isn't the cpus_share_cache() code block mostly about not doing
> waker-affining if prev_cpu of the wakee is close enough and idle, so
> waker-affining is likely to be worse?
> 
> Thanks.
>

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Tejun Heo 10 months, 3 weeks ago

On Mon, Mar 17, 2025 at 05:12:25PM -0700, Libo Chen wrote:
> Just as Peter said, this is whole wake affinity thing is highly workload
> dependent. We have seen cases where even if there are idle cores in the
> prev node, it's still beneficial to cross the NUMA boundary. Will it make
> more sense to have the BPF scheduler implement/alter some of logic here so
> it can fit to different workload?

Oh yeah, multiple SCX schedulers already implement their own select_cpu().
This discussion is just around the default implementation which can be used
if a BPF scheduler doesn't want to implement its own.

Thanks.

-- 
tejun

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Joel Fernandes 10 months, 3 weeks ago

On 3/17/2025 6:08 PM, Tejun Heo wrote:
> Hello, Joel.
> 
> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote:
>> Consider that the previous CPU is cache affined to the waker's CPU and
>> is idle. Currently, scx's default select function only selects the
>> previous CPU in this case if WF_SYNC request is also made to wakeup on the
>> waker's CPU.
>>
>> This means, without WF_SYNC, the previous CPU being cache affined to the
>> waker and is idle is not considered. This seems extreme. WF_SYNC is not
>> normally passed to the wakeup path outside of some IPC drivers but it is
>> very possible that the task is cache hot on previous CPU and shares
>> cache with the waker CPU. Lets avoid too many migrations and select the
>> previous CPU in such cases.
> 
> Hmm.. if !WF_SYNC:
> 
> 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle
>    core in widening scopes.
> 
> 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an
>    idle CPU in widening scopes.
> 
> So, it is considering prev_cpu, right?

Yes, but unlike the FAIR scheduler, it is not selecting the previous idle SMT
for a busy core, and instead looking for a fully idle core first. I saw that as
a difference between what FAIR and EXT do.

> I think it's preferring idle core a
> bit too much - it probably doesn't make sense to cross the NUMA boundary if
> there is an idle CPU in this node, at least.
> 

Yes, that is a bit extreme. I believe that is what my patch is fixing.  If
previous CPU and current CPU share cache, we prefer busy cores with free
previous idle SMT, otherwise go looking for fully idle cores. But it sounds like
from Peter's reply that is not necessarily a good thing to do in case 'fast
numa'. So I guess there is no good answer (?) or way of doing it.

I guess one of things I would like is for FAIR and EXT's selection algorithms
(default in EXT's case) are close in behavior and consistent. And maybe we can
unify their behaviors in this regard, and 'do the right thing' for majority of
workloads and machines.

> Isn't the cpus_share_cache() code block mostly about not doing
> waker-affining if prev_cpu of the wakee is close enough and idle, so
> waker-affining is likely to be worse?

It seems with FAIR, we do waker-affining first:

In select_idle_sibling(...):
	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
	    asym_fits_cpu(task_util, util_min, util_max, target))
		return target;

before checking if previous CPU shares cache:

In select_idle_sibling(...):
	/*
	 * If the previous CPU is cache affine and idle, don't be stupid:
	 */
	if (prev != target && cpus_share_cache(prev, target) &&
	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
	    asym_fits_cpu(task_util, util_min, util_max, prev)) {

		if (!static_branch_unlikely(&sched_cluster_active) ||
		    cpus_share_resources(prev, target))
			return prev;

		prev_aff = prev;
	}

So I think that block is not about "not doing waker affining" but about "doing
wakee affining".

thanks,

 - Joel

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Tejun Heo 10 months, 3 weeks ago

Hello,

On Mon, Mar 17, 2025 at 11:07:36PM +0100, Joel Fernandes wrote:
...
> > I think it's preferring idle core a
> > bit too much - it probably doesn't make sense to cross the NUMA boundary if
> > there is an idle CPU in this node, at least.
> 
> Yes, that is a bit extreme. I believe that is what my patch is fixing.  If
> previous CPU and current CPU share cache, we prefer busy cores with free
> previous idle SMT, otherwise go looking for fully idle cores. But it sounds like
> from Peter's reply that is not necessarily a good thing to do in case 'fast
> numa'. So I guess there is no good answer (?) or way of doing it.

Yeah, recent AMD CPUs can be configured into NUMA mode where each chiplet is
reported as a node and they do behave like one as each has its own memory
connection but the distances among them are significantly closer than
traditional multi-socket. That said, I think the implementation is still a
bit too happy to jump the boundary. What Andrea is suggesting seems
reasonable?

Thanks.

-- 
tejun

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Joel Fernandes 10 months, 3 weeks ago


On 3/17/2025 11:25 PM, Tejun Heo wrote:
> Hello,
> 
> On Mon, Mar 17, 2025 at 11:07:36PM +0100, Joel Fernandes wrote:
> ...
>>> I think it's preferring idle core a
>>> bit too much - it probably doesn't make sense to cross the NUMA boundary if
>>> there is an idle CPU in this node, at least.
>>
>> Yes, that is a bit extreme. I believe that is what my patch is fixing.  If
>> previous CPU and current CPU share cache, we prefer busy cores with free
>> previous idle SMT, otherwise go looking for fully idle cores. But it sounds like
>> from Peter's reply that is not necessarily a good thing to do in case 'fast
>> numa'. So I guess there is no good answer (?) or way of doing it.
> 
> Yeah, recent AMD CPUs can be configured into NUMA mode where each chiplet is
> reported as a node and they do behave like one as each has its own memory
> connection but the distances among them are significantly closer than
> traditional multi-socket. That said, I think the implementation is still a
> bit too happy to jump the boundary. What Andrea is suggesting seems
> reasonable?
> 

Thanks for sharing the AMD configurations. Yeah, agreed on Andrea's suggestion.

 - Joel

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Andrea Righi 10 months, 3 weeks ago

On Mon, Mar 17, 2025 at 07:08:15AM -1000, Tejun Heo wrote:
> Hello, Joel.
> 
> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote:
> > Consider that the previous CPU is cache affined to the waker's CPU and
> > is idle. Currently, scx's default select function only selects the
> > previous CPU in this case if WF_SYNC request is also made to wakeup on the
> > waker's CPU.
> > 
> > This means, without WF_SYNC, the previous CPU being cache affined to the
> > waker and is idle is not considered. This seems extreme. WF_SYNC is not
> > normally passed to the wakeup path outside of some IPC drivers but it is
> > very possible that the task is cache hot on previous CPU and shares
> > cache with the waker CPU. Lets avoid too many migrations and select the
> > previous CPU in such cases.
> 
> Hmm.. if !WF_SYNC:
> 
> 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle
>    core in widening scopes.
> 
> 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an
>    idle CPU in widening scopes.
> 
> So, it is considering prev_cpu, right? I think it's preferring idle core a
> bit too much - it probably doesn't make sense to cross the NUMA boundary if
> there is an idle CPU in this node, at least.

Yeah, we should probably be a bit more conservative by default and avoid
jumping across nodes if there are still idle CPUs within the node.

With the new scx_bpf_select_cpu_and() API [1] it'll be easier to enforce
that while still using the built-in idle policy (since we can specify idle
flags), but that doesn't preclude adjusting the default policy anyway, if
it makes more sense.

I guess the question is: what is more expensive in general on task wakeup?
1) a cross-node migration or 2) running on a partially busy SMT core?

-Andrea

[1] https://lore.kernel.org/all/20250314094827.167563-1-arighi@nvidia.com/

> 
> Isn't the cpus_share_cache() code block mostly about not doing
> waker-affining if prev_cpu of the wakee is close enough and idle, so
> waker-affining is likely to be worse?
> 
> Thanks.
> 
> -- 
> tejun

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Joel Fernandes 10 months, 3 weeks ago


On 3/17/2025 6:30 PM, Andrea Righi wrote:
> On Mon, Mar 17, 2025 at 07:08:15AM -1000, Tejun Heo wrote:
>> Hello, Joel.
>>
>> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote:
>>> Consider that the previous CPU is cache affined to the waker's CPU and
>>> is idle. Currently, scx's default select function only selects the
>>> previous CPU in this case if WF_SYNC request is also made to wakeup on the
>>> waker's CPU.
>>>
>>> This means, without WF_SYNC, the previous CPU being cache affined to the
>>> waker and is idle is not considered. This seems extreme. WF_SYNC is not
>>> normally passed to the wakeup path outside of some IPC drivers but it is
>>> very possible that the task is cache hot on previous CPU and shares
>>> cache with the waker CPU. Lets avoid too many migrations and select the
>>> previous CPU in such cases.
>> Hmm.. if !WF_SYNC:
>>
>> 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle
>>    core in widening scopes.
>>
>> 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an
>>    idle CPU in widening scopes.
>>
>> So, it is considering prev_cpu, right? I think it's preferring idle core a
>> bit too much - it probably doesn't make sense to cross the NUMA boundary if
>> there is an idle CPU in this node, at least.
>
> Yeah, we should probably be a bit more conservative by default and avoid
> jumping across nodes if there are still idle CPUs within the node.
>

Agreed. So maybe we check for fully idle cores *within the node* first, before
preferring idle SMTs *within the node* ? And then, as next step go looking at
other nodes. Would that be a reasonable middle ground?

> With the new scx_bpf_select_cpu_and() API [1] it'll be easier to enforce
> that while still using the built-in idle policy (since we can specify idle
> flags), but that doesn't preclude adjusting the default policy anyway, if
> it makes more sense.

Aren't you deprecating the usage of the default select function? If we are going
to be adjusting its behavior like my patch is doing, then we should probably not
also deprecate it.

thanks,

 - Joel

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Andrea Righi 10 months, 3 weeks ago

On Mon, Mar 17, 2025 at 11:11:08PM +0100, Joel Fernandes wrote:
> 
> 
> On 3/17/2025 6:30 PM, Andrea Righi wrote:
> > On Mon, Mar 17, 2025 at 07:08:15AM -1000, Tejun Heo wrote:
> >> Hello, Joel.
> >>
> >> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote:
> >>> Consider that the previous CPU is cache affined to the waker's CPU and
> >>> is idle. Currently, scx's default select function only selects the
> >>> previous CPU in this case if WF_SYNC request is also made to wakeup on the
> >>> waker's CPU.
> >>>
> >>> This means, without WF_SYNC, the previous CPU being cache affined to the
> >>> waker and is idle is not considered. This seems extreme. WF_SYNC is not
> >>> normally passed to the wakeup path outside of some IPC drivers but it is
> >>> very possible that the task is cache hot on previous CPU and shares
> >>> cache with the waker CPU. Lets avoid too many migrations and select the
> >>> previous CPU in such cases.
> >> Hmm.. if !WF_SYNC:
> >>
> >> 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle
> >>    core in widening scopes.
> >>
> >> 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an
> >>    idle CPU in widening scopes.
> >>
> >> So, it is considering prev_cpu, right? I think it's preferring idle core a
> >> bit too much - it probably doesn't make sense to cross the NUMA boundary if
> >> there is an idle CPU in this node, at least.
> >
> > Yeah, we should probably be a bit more conservative by default and avoid
> > jumping across nodes if there are still idle CPUs within the node.
> >
> 
> Agreed. So maybe we check for fully idle cores *within the node* first, before
> preferring idle SMTs *within the node* ? And then, as next step go looking at
> other nodes. Would that be a reasonable middle ground?
> 
> > With the new scx_bpf_select_cpu_and() API [1] it'll be easier to enforce
> > that while still using the built-in idle policy (since we can specify idle
> > flags), but that doesn't preclude adjusting the default policy anyway, if
> > it makes more sense.
> 
> Aren't you deprecating the usage of the default select function? If we are going
> to be adjusting its behavior like my patch is doing, then we should probably not
> also deprecate it.

I'm just extending the default select function to accept a cpumask and idle
SCX_PICK_IDLE_* flags, so that it's easier for BPF schedulers to change the
select behavior without reimplementing the whole thing.

The old scx_bpf_select_cpu_dfl() will be remapped to the new API for a
while for backward compatibility and the underlying selection logic remains
the same.

So, in this case for example, you could implement the "check full-idle then
partial-idle SMT CPUs within the node" logic as following:

/* Search for full-idle SMT first, then idle CPUs within prev_cpu's node */
cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags,
			     p->cpus_ptr, SCX_PICK_IDLE_IN_NODE)
if (cpu < 0) {
	/* Search for full-idle SMT first, then idle CPUs across all nodes */
	cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, p->cpus_ptr, 0)
}

-Andrea

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Joel Fernandes 10 months, 3 weeks ago

On Tue, Mar 18, 2025 at 06:09:42AM +0100, Andrea Righi wrote:
> On Mon, Mar 17, 2025 at 11:11:08PM +0100, Joel Fernandes wrote:
> > 
> > 
> > On 3/17/2025 6:30 PM, Andrea Righi wrote:
> > > On Mon, Mar 17, 2025 at 07:08:15AM -1000, Tejun Heo wrote:
> > >> Hello, Joel.
> > >>
> > >> On Mon, Mar 17, 2025 at 04:28:02AM -0400, Joel Fernandes wrote:
> > >>> Consider that the previous CPU is cache affined to the waker's CPU and
> > >>> is idle. Currently, scx's default select function only selects the
> > >>> previous CPU in this case if WF_SYNC request is also made to wakeup on the
> > >>> waker's CPU.
> > >>>
> > >>> This means, without WF_SYNC, the previous CPU being cache affined to the
> > >>> waker and is idle is not considered. This seems extreme. WF_SYNC is not
> > >>> normally passed to the wakeup path outside of some IPC drivers but it is
> > >>> very possible that the task is cache hot on previous CPU and shares
> > >>> cache with the waker CPU. Lets avoid too many migrations and select the
> > >>> previous CPU in such cases.
> > >> Hmm.. if !WF_SYNC:
> > >>
> > >> 1. If smt, if prev_cpu's core is idle, pick it. If not, try to pick an idle
> > >>    core in widening scopes.
> > >>
> > >> 2. If no idle core is foudn, pick prev_cpu if idle. If not, search for an
> > >>    idle CPU in widening scopes.
> > >>
> > >> So, it is considering prev_cpu, right? I think it's preferring idle core a
> > >> bit too much - it probably doesn't make sense to cross the NUMA boundary if
> > >> there is an idle CPU in this node, at least.
> > >
> > > Yeah, we should probably be a bit more conservative by default and avoid
> > > jumping across nodes if there are still idle CPUs within the node.
> > >
> > 
> > Agreed. So maybe we check for fully idle cores *within the node* first, before
> > preferring idle SMTs *within the node* ? And then, as next step go looking at
> > other nodes. Would that be a reasonable middle ground?
> > 
> > > With the new scx_bpf_select_cpu_and() API [1] it'll be easier to enforce
> > > that while still using the built-in idle policy (since we can specify idle
> > > flags), but that doesn't preclude adjusting the default policy anyway, if
> > > it makes more sense.
> > 
> > Aren't you deprecating the usage of the default select function? If we are going
> > to be adjusting its behavior like my patch is doing, then we should probably not
> > also deprecate it.
> 
> I'm just extending the default select function to accept a cpumask and idle
> SCX_PICK_IDLE_* flags, so that it's easier for BPF schedulers to change the
> select behavior without reimplementing the whole thing.
> 
> The old scx_bpf_select_cpu_dfl() will be remapped to the new API for a
> while for backward compatibility and the underlying selection logic remains
> the same.
> 
> So, in this case for example, you could implement the "check full-idle then
> partial-idle SMT CPUs within the node" logic as following:
> 
> /* Search for full-idle SMT first, then idle CPUs within prev_cpu's node */
> cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags,
> 			     p->cpus_ptr, SCX_PICK_IDLE_IN_NODE)
> if (cpu < 0) {
> 	/* Search for full-idle SMT first, then idle CPUs across all nodes */
> 	cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, p->cpus_ptr, 0)
> }

Thanks, Andrea! I adjusted the default selection as below, hope it looks good
now, will test it more as well. Let me know any comments.

----------------8<-----

From: Joel Fernandes <joelagnelf@nvidia.com>
Subject: [PATCH] sched/ext: Make default idle CPU selection better

Currently, sched_ext's default CPU selection is roughly something like
this:

1. Look for FULLY IDLE CORES:
    1.1. Select prev CPU (wakee) if its CORE is fully idle.
    1.2. Or, pick any CPU from fully idle CORE in the L3, then NUMA.
    1.3. Or, any idle CPU from fully idle CORE usable by task.
2. Or, use PREV CPU if it is idle.
3. Or any idle CPU in the LLC, NUMA.
4. Or finally any CPU usable by the task.

This can end up select any idle core in the system even if that means
jumping across NUMA nodes (basically 1.3 happens before 3.).

Improve this by moving 1.3 to after 3 (so that skipping over NUMA
happens only later) and also add selection of fully idle target (waker)
core before looking for fully-idle cores in the LLC/NUMA. This is similar to
what FAIR scheduler does.

The new sequence is as follows:

1. Look for FULLY IDLE CORES:
    1.1. Select prev CPU (wakee) if its CORE is fully idle.
    1.2. Select target CPU (waker) if its CORE is fully idle and shares cache
        with prev. <- Added this.
    1.3. Or, pick any CPU from fully idle CORE in the L3, then NUMA.
2. Or, use PREV CPU if it is idle.
3. Or any idle CPU in the LLC, NUMA.
4. Or, any idle CPU from fully idle CORE usable by task. <- Moved down.
5. Or finally any CPU usable by the task.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/ext.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5a81d9a1e31f..324e442319c7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3558,6 +3558,16 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			goto cpu_found;
 		}
 
+		/*
+		 * If the waker's CPU shares cache with @prev_cpu and is part
+		 * of a fully idle core, select it.
+		 */
+		if (cpus_share_cache(cpu, prev_cpu) &&
+		    cpumask_test_cpu(cpu, idle_masks.smt) &&
+		    test_and_clear_cpu_idle(cpu)) {
+			goto cpu_found;
+		}
+
 		/*
 		 * Search for any fully idle core in the same LLC domain.
 		 */
@@ -3575,13 +3585,6 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			if (cpu >= 0)
 				goto cpu_found;
 		}
-
-		/*
-		 * Search for any full idle core usable by the task.
-		 */
-		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
-		if (cpu >= 0)
-			goto cpu_found;
 	}
 
 	/*
@@ -3610,6 +3613,15 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			goto cpu_found;
 	}
 
+	/*
+	 * Search for any full idle core usable by the task.
+	 */
+	if (sched_smt_active()) {
+		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
+		if (cpu >= 0)
+			goto cpu_found;
+	}
+
 	/*
 	 * Search for any idle CPU usable by the task.
 	 */
-- 
2.43.0

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Andrea Righi 10 months, 3 weeks ago

Hi Joel,

On Tue, Mar 18, 2025 at 01:00:47PM -0400, Joel Fernandes wrote:
...
> From: Joel Fernandes <joelagnelf@nvidia.com>
> Subject: [PATCH] sched/ext: Make default idle CPU selection better
> 
> Currently, sched_ext's default CPU selection is roughly something like
> this:
> 
> 1. Look for FULLY IDLE CORES:
>     1.1. Select prev CPU (wakee) if its CORE is fully idle.
>     1.2. Or, pick any CPU from fully idle CORE in the L3, then NUMA.
>     1.3. Or, any idle CPU from fully idle CORE usable by task.
> 2. Or, use PREV CPU if it is idle.
> 3. Or any idle CPU in the LLC, NUMA.
> 4. Or finally any CPU usable by the task.
> 
> This can end up select any idle core in the system even if that means
> jumping across NUMA nodes (basically 1.3 happens before 3.).
> 
> Improve this by moving 1.3 to after 3 (so that skipping over NUMA
> happens only later) and also add selection of fully idle target (waker)
> core before looking for fully-idle cores in the LLC/NUMA. This is similar to
> what FAIR scheduler does.
> 
> The new sequence is as follows:
> 
> 1. Look for FULLY IDLE CORES:
>     1.1. Select prev CPU (wakee) if its CORE is fully idle.
>     1.2. Select target CPU (waker) if its CORE is fully idle and shares cache
>         with prev. <- Added this.
>     1.3. Or, pick any CPU from fully idle CORE in the L3, then NUMA.
> 2. Or, use PREV CPU if it is idle.
> 3. Or any idle CPU in the LLC, NUMA.
> 4. Or, any idle CPU from fully idle CORE usable by task. <- Moved down.
> 5. Or finally any CPU usable by the task.
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/sched/ext.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 5a81d9a1e31f..324e442319c7 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3558,6 +3558,16 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  			goto cpu_found;
>  		}
>  
> +		/*
> +		 * If the waker's CPU shares cache with @prev_cpu and is part
> +		 * of a fully idle core, select it.
> +		 */
> +		if (cpus_share_cache(cpu, prev_cpu) &&
> +		    cpumask_test_cpu(cpu, idle_masks.smt) &&
> +		    test_and_clear_cpu_idle(cpu)) {

I think this is always false, because cpu is still in use by the waker and
its state hasn't been updated to idle yet.

> +			goto cpu_found;
> +		}
> +

-Andrea

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Joel Fernandes 10 months, 3 weeks ago


On 3/18/2025 6:46 PM, Andrea Righi wrote:
> Hi Joel,
> 
> On Tue, Mar 18, 2025 at 01:00:47PM -0400, Joel Fernandes wrote:
> ...
>> From: Joel Fernandes <joelagnelf@nvidia.com>
>> Subject: [PATCH] sched/ext: Make default idle CPU selection better
>>
>> Currently, sched_ext's default CPU selection is roughly something like
>> this:
>>
>> 1. Look for FULLY IDLE CORES:
>>     1.1. Select prev CPU (wakee) if its CORE is fully idle.
>>     1.2. Or, pick any CPU from fully idle CORE in the L3, then NUMA.
>>     1.3. Or, any idle CPU from fully idle CORE usable by task.
>> 2. Or, use PREV CPU if it is idle.
>> 3. Or any idle CPU in the LLC, NUMA.
>> 4. Or finally any CPU usable by the task.
>>
>> This can end up select any idle core in the system even if that means
>> jumping across NUMA nodes (basically 1.3 happens before 3.).
>>
>> Improve this by moving 1.3 to after 3 (so that skipping over NUMA
>> happens only later) and also add selection of fully idle target (waker)
>> core before looking for fully-idle cores in the LLC/NUMA. This is similar to
>> what FAIR scheduler does.
>>
>> The new sequence is as follows:
>>
>> 1. Look for FULLY IDLE CORES:
>>     1.1. Select prev CPU (wakee) if its CORE is fully idle.
>>     1.2. Select target CPU (waker) if its CORE is fully idle and shares cache
>>         with prev. <- Added this.
>>     1.3. Or, pick any CPU from fully idle CORE in the L3, then NUMA.
>> 2. Or, use PREV CPU if it is idle.
>> 3. Or any idle CPU in the LLC, NUMA.
>> 4. Or, any idle CPU from fully idle CORE usable by task. <- Moved down.
>> 5. Or finally any CPU usable by the task.
>>
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>> ---
>>  kernel/sched/ext.c | 26 +++++++++++++++++++-------
>>  1 file changed, 19 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 5a81d9a1e31f..324e442319c7 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -3558,6 +3558,16 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>>  			goto cpu_found;
>>  		}
>>  
>> +		/*
>> +		 * If the waker's CPU shares cache with @prev_cpu and is part
>> +		 * of a fully idle core, select it.
>> +		 */
>> +		if (cpus_share_cache(cpu, prev_cpu) &&
>> +		    cpumask_test_cpu(cpu, idle_masks.smt) &&
>> +		    test_and_clear_cpu_idle(cpu)) {
> 
> I think this is always false, because cpu is still in use by the waker and
> its state hasn't been updated to idle yet.
> 
Summarizing our conference in-person meeting, we verified that in case of
wakeups having from IRQ, the waker could be in idle state at the time of select.
So I'll keep the change and will rebase and send it out again soon.

thanks,

 - Joel

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Peter Zijlstra 10 months, 3 weeks ago

On Mon, Mar 17, 2025 at 06:30:57PM +0100, Andrea Righi wrote:

> I guess the question is: what is more expensive in general on task wakeup?
> 1) a cross-node migration or 2) running on a partially busy SMT core?

That totally depends on both the workload and the actual machine :/

If you have 'fast' numa and not a very big footprint, the numa
migrations aren't too bad. OTOH if you have sucky numa or your memory
footprint is significant, then running on the wrong node is super
painful.

Re: [PATCH RFC] sched_ext: Choose prev_cpu if idle and cache affine without WF_SYNC

Posted by Andrea Righi 10 months, 3 weeks ago

On Mon, Mar 17, 2025 at 06:44:31PM +0100, Peter Zijlstra wrote:
> On Mon, Mar 17, 2025 at 06:30:57PM +0100, Andrea Righi wrote:
> 
> > I guess the question is: what is more expensive in general on task wakeup?
> > 1) a cross-node migration or 2) running on a partially busy SMT core?
> 
> That totally depends on both the workload and the actual machine :/
> 
> If you have 'fast' numa and not a very big footprint, the numa
> migrations aren't too bad. OTOH if you have sucky numa or your memory
> footprint is significant, then running on the wrong node is super
> painful.

Right, there isn't a single "best solution" in general, I guess we just
need to pick what we think it works best in most cases, set that as
default, and then leave it to the BPF schedulers to adjust the behavior as
needed (at the end this is just a default policy).

Thanks,
-Andrea