[v1] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

[PATCH sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Andrea Righi 3 weeks, 3 days ago

When the LLC and NUMA domains fully overlap, enabling both optimizations
in the built-in idle CPU selection policy is redundant, as it leads to
searching for an idle CPU within the same domain twice.

Likewise, if all online CPUs are within a single LLC domain, LLC
optimization is unnecessary.

Therefore, detect overlapping domains and enable topology optimizations
only when necessary.

Fixes: 860a45219bce ("sched_ext: Introduce NUMA awareness to the default idle selection policy")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index fc7f15eefe54..82acbaffd5a7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3140,7 +3140,7 @@ static void update_selcpu_topology(void)
 {
 	bool enable_llc = false, enable_numa = false;
 	struct sched_domain *sd;
-	const struct cpumask *cpus;
+	const struct cpumask *llc_cpus = NULL, *numa_cpus = NULL;
 	s32 cpu = cpumask_first(cpu_online_mask);
 
 	/*
@@ -3154,16 +3154,29 @@ static void update_selcpu_topology(void)
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_llc, cpu));
 	if (sd) {
-		cpus = sched_domain_span(sd);
-		if (cpumask_weight(cpus) < num_possible_cpus())
+		llc_cpus = sched_domain_span(sd);
+		if (cpumask_weight(llc_cpus) < num_possible_cpus())
 			enable_llc = true;
 	}
 	sd = highest_flag_domain(cpu, SD_NUMA);
 	if (sd) {
-		cpus = sched_group_span(sd->groups);
-		if (cpumask_weight(cpus) < num_possible_cpus())
+		numa_cpus = sched_group_span(sd->groups);
+		if (cpumask_weight(numa_cpus) < num_possible_cpus())
 			enable_numa = true;
 	}
+	/*
+	 * If the NUMA domain perfectly overlaps with the LLC domain, enable
+	 * LLC optimization only, as checking for an idle CPU in the same
+	 * domain twice is redundant.
+	 */
+	if (enable_numa && enable_llc && cpumask_equal(numa_cpus, llc_cpus))
+		enable_numa = false;
+	/*
+	 * If all the online CPUs are in the same LLC domain, there is no
+	 * reason to enable LLC optimization.
+	 */
+	if (enable_llc && cpumask_equal(llc_cpus, cpu_online_mask))
+		enable_llc = false;
 	rcu_read_unlock();
 
 	pr_debug("sched_ext: LLC idle selection %s\n",
-- 
2.47.0

Re: [PATCH sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Tejun Heo 2 weeks, 4 days ago

Hello, Andrea.

Sorry about the delay.

On Thu, Oct 31, 2024 at 08:34:01AM +0100, Andrea Righi wrote:
> @@ -3154,16 +3154,29 @@ static void update_selcpu_topology(void)
>  	rcu_read_lock();
>  	sd = rcu_dereference(per_cpu(sd_llc, cpu));
>  	if (sd) {
> -		cpus = sched_domain_span(sd);
> -		if (cpumask_weight(cpus) < num_possible_cpus())
> +		llc_cpus = sched_domain_span(sd);
> +		if (cpumask_weight(llc_cpus) < num_possible_cpus())

Not from this patch but should the weight be compared against
num_online_cpus()? Sched domains don't include offline CPUs, right?

...
> +	/*
> +	 * If the NUMA domain perfectly overlaps with the LLC domain, enable
> +	 * LLC optimization only, as checking for an idle CPU in the same
> +	 * domain twice is redundant.
> +	 */
> +	if (enable_numa && enable_llc && cpumask_equal(numa_cpus, llc_cpus))
> +		enable_numa = false;
> +	/*
> +	 * If all the online CPUs are in the same LLC domain, there is no
> +	 * reason to enable LLC optimization.
> +	 */
> +	if (enable_llc && cpumask_equal(llc_cpus, cpu_online_mask))
> +		enable_llc = false;

The second test looks like it should always be correct. I'm not sure the
first one is. It probably is good enough but would at least be worthwhile to
explain why that is?

Thanks.

-- 
tejun

Re: [PATCH sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Andrea Righi 2 weeks, 4 days ago

On Tue, Nov 05, 2024 at 01:32:05PM -1000, Tejun Heo wrote:
> Hello, Andrea.
> 
> Sorry about the delay.
> 
> On Thu, Oct 31, 2024 at 08:34:01AM +0100, Andrea Righi wrote:
> > @@ -3154,16 +3154,29 @@ static void update_selcpu_topology(void)
> >       rcu_read_lock();
> >       sd = rcu_dereference(per_cpu(sd_llc, cpu));
> >       if (sd) {
> > -             cpus = sched_domain_span(sd);
> > -             if (cpumask_weight(cpus) < num_possible_cpus())
> > +             llc_cpus = sched_domain_span(sd);
> > +             if (cpumask_weight(llc_cpus) < num_possible_cpus())
> 
> Not from this patch but should the weight be compared against
> num_online_cpus()? Sched domains don't include offline CPUs, right?

That's right, sched domain is definitely updated when a CPU goes offine
(I've just verified), so we should check for num_online_cpus() here.
Thanks for noticing it!

> 
> ...
> > +     /*
> > +      * If the NUMA domain perfectly overlaps with the LLC domain, enable
> > +      * LLC optimization only, as checking for an idle CPU in the same
> > +      * domain twice is redundant.
> > +      */
> > +     if (enable_numa && enable_llc && cpumask_equal(numa_cpus, llc_cpus))
> > +             enable_numa = false;
> > +     /*
> > +      * If all the online CPUs are in the same LLC domain, there is no
> > +      * reason to enable LLC optimization.
> > +      */
> > +     if (enable_llc && cpumask_equal(llc_cpus, cpu_online_mask))
> > +             enable_llc = false;
> 
> The second test looks like it should always be correct. I'm not sure the
> first one is. It probably is good enough but would at least be worthwhile to
> explain why that is?

Let's say we have 2 NUMA nodes, each with 2 sockets, and each socket
has its own L3 cache. In this case, numa_cpus will be larger than
llc_cpus, and enabling both NUMA and LLC optimizations would be
beneficial.

On the other hand, if each NUMA node contains only 1 socket, numa_cpus
and llc_cpus will overlap completely, making it unnecessary to enable
both NUMA and LLC optimizations, so we can have just the LLC in this
case.

Would something like this help clarifying the first test?

Thanks,
-Andrea

Re: [PATCH sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Tejun Heo 2 weeks, 4 days ago

Hello,

On Wed, Nov 06, 2024 at 01:29:08AM +0100, Andrea Righi wrote:
...
> Let's say we have 2 NUMA nodes, each with 2 sockets, and each socket
> has its own L3 cache. In this case, numa_cpus will be larger than
> llc_cpus, and enabling both NUMA and LLC optimizations would be
> beneficial.
> 
> On the other hand, if each NUMA node contains only 1 socket, numa_cpus
> and llc_cpus will overlap completely, making it unnecessary to enable
> both NUMA and LLC optimizations, so we can have just the LLC in this
> case.
> 
> Would something like this help clarifying the first test?

I was more thinking about the theoretical case where one socket has one LLC
while a different socket has multiple LLCs. I don't think there are any
systems which are actually like that but there's nothing in the code which
prevents that (unlike a single CPU belonging to multiple domains), so it'd
probably be worthwhile to explain why the abbreviated test is enough.

Thanks.

-- 
tejun

Re: [PATCH sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Andrea Righi 2 weeks, 4 days ago

On Tue, Nov 05, 2024 at 02:33:53PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Wed, Nov 06, 2024 at 01:29:08AM +0100, Andrea Righi wrote:
> ...
> > Let's say we have 2 NUMA nodes, each with 2 sockets, and each socket
> > has its own L3 cache. In this case, numa_cpus will be larger than
> > llc_cpus, and enabling both NUMA and LLC optimizations would be
> > beneficial.
> >
> > On the other hand, if each NUMA node contains only 1 socket, numa_cpus
> > and llc_cpus will overlap completely, making it unnecessary to enable
> > both NUMA and LLC optimizations, so we can have just the LLC in this
> > case.
> >
> > Would something like this help clarifying the first test?
> 
> I was more thinking about the theoretical case where one socket has one LLC
> while a different socket has multiple LLCs. I don't think there are any
> systems which are actually like that but there's nothing in the code which
> prevents that (unlike a single CPU belonging to multiple domains), so it'd
> probably be worthwhile to explain why the abbreviated test is enough.

In theory a CPU can only belong to a single domain (otherwise other
stuff in topology.c are broken as well), but potentially we could have
something like:

 NUMA 1
  - CPU 1 (L3)
 NUMA 2
  - CPU 2 (L3)
  - CPU 3 (L3)

If we inspect CPU 1 only we may incorrectly assume that numa_cpus ==
llc_cpus. To handle this properly we may have to inspect
all the CPUs, instead of just the first one.

Moreover, with qemu we can also simulate ugly topologies like 2 NUMA
nodes and 1 L3 cache that covers the 2 NUMA nodes:

arighi@gpd3~/s/linux (master)> vng --cpu 4 -m 4G --numa 2G,cpus=0-1 --numa 2G,cpus=2-3
...
arighi@virtme-ng~/s/linux (master)> lscpu -e
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
  0    0      0    0 0:0:0:0          yes
  1    0      0    1 1:1:1:0          yes
  2    1      0    2 2:2:2:0          yes
  3    1      0    3 3:3:3:0          yes
arighi@virtme-ng~/s/linux (master)> numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1
node 0 size: 2014 MB
node 0 free: 1931 MB
node 1 cpus: 2 3
node 1 size: 1896 MB
node 1 free: 1847 MB
node distances:
node     0    1
   0:   10   20
   1:   20   10

I think this is only possible in a virtualized environment, in this case
LLC should be disabled and NUMA enabled. Maybe it's worth checking also
for the case where LLC > NUMA...

-Andrea

Re: [PATCH sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Tejun Heo 2 weeks, 4 days ago

Hello, Andrea.

On Wed, Nov 06, 2024 at 02:08:00AM +0100, Andrea Righi wrote:
...
> I think this is only possible in a virtualized environment, in this case
> LLC should be disabled and NUMA enabled. Maybe it's worth checking also
> for the case where LLC > NUMA...

I think the current code is okay. It just needs a short comment explaining
why it's so. A cleaner alternative may be walking the llc sd's and verify
whether any two llcs share the same node sd as an ancestor.

Thanks.

-- 
tejun

Re: [PATCH sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Tejun Heo 2 weeks, 4 days ago

On Tue, Nov 05, 2024 at 03:15:59PM -1000, Tejun Heo wrote:
> Hello, Andrea.
> 
> On Wed, Nov 06, 2024 at 02:08:00AM +0100, Andrea Righi wrote:
> ...
> > I think this is only possible in a virtualized environment, in this case
> > LLC should be disabled and NUMA enabled. Maybe it's worth checking also
> > for the case where LLC > NUMA...
> 
> I think the current code is okay. It just needs a short comment explaining
> why it's so. A cleaner alternative may be walking the llc sd's and verify
> whether any two llcs share the same node sd as an ancestor.

Or maybe just compare the total number of LLC sd's and NUMA sd's.

Thanks.

-- 
tejun