kernel/sched/ext.c | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-)
When the LLC and NUMA domains fully overlap, enabling both optimizations
in the built-in idle CPU selection policy is redundant, as it leads to
searching for an idle CPU within the same domain twice.
Likewise, if all online CPUs are within a single LLC domain, LLC
optimization is unnecessary.
Therefore, detect overlapping domains and enable topology optimizations
only when necessary.
Fixes: 860a45219bce ("sched_ext: Introduce NUMA awareness to the default idle selection policy")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index fc7f15eefe54..82acbaffd5a7 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3140,7 +3140,7 @@ static void update_selcpu_topology(void)
{
bool enable_llc = false, enable_numa = false;
struct sched_domain *sd;
- const struct cpumask *cpus;
+ const struct cpumask *llc_cpus = NULL, *numa_cpus = NULL;
s32 cpu = cpumask_first(cpu_online_mask);
/*
@@ -3154,16 +3154,29 @@ static void update_selcpu_topology(void)
rcu_read_lock();
sd = rcu_dereference(per_cpu(sd_llc, cpu));
if (sd) {
- cpus = sched_domain_span(sd);
- if (cpumask_weight(cpus) < num_possible_cpus())
+ llc_cpus = sched_domain_span(sd);
+ if (cpumask_weight(llc_cpus) < num_possible_cpus())
enable_llc = true;
}
sd = highest_flag_domain(cpu, SD_NUMA);
if (sd) {
- cpus = sched_group_span(sd->groups);
- if (cpumask_weight(cpus) < num_possible_cpus())
+ numa_cpus = sched_group_span(sd->groups);
+ if (cpumask_weight(numa_cpus) < num_possible_cpus())
enable_numa = true;
}
+ /*
+ * If the NUMA domain perfectly overlaps with the LLC domain, enable
+ * LLC optimization only, as checking for an idle CPU in the same
+ * domain twice is redundant.
+ */
+ if (enable_numa && enable_llc && cpumask_equal(numa_cpus, llc_cpus))
+ enable_numa = false;
+ /*
+ * If all the online CPUs are in the same LLC domain, there is no
+ * reason to enable LLC optimization.
+ */
+ if (enable_llc && cpumask_equal(llc_cpus, cpu_online_mask))
+ enable_llc = false;
rcu_read_unlock();
pr_debug("sched_ext: LLC idle selection %s\n",
--
2.47.0
Hello, Andrea. Sorry about the delay. On Thu, Oct 31, 2024 at 08:34:01AM +0100, Andrea Righi wrote: > @@ -3154,16 +3154,29 @@ static void update_selcpu_topology(void) > rcu_read_lock(); > sd = rcu_dereference(per_cpu(sd_llc, cpu)); > if (sd) { > - cpus = sched_domain_span(sd); > - if (cpumask_weight(cpus) < num_possible_cpus()) > + llc_cpus = sched_domain_span(sd); > + if (cpumask_weight(llc_cpus) < num_possible_cpus()) Not from this patch but should the weight be compared against num_online_cpus()? Sched domains don't include offline CPUs, right? ... > + /* > + * If the NUMA domain perfectly overlaps with the LLC domain, enable > + * LLC optimization only, as checking for an idle CPU in the same > + * domain twice is redundant. > + */ > + if (enable_numa && enable_llc && cpumask_equal(numa_cpus, llc_cpus)) > + enable_numa = false; > + /* > + * If all the online CPUs are in the same LLC domain, there is no > + * reason to enable LLC optimization. > + */ > + if (enable_llc && cpumask_equal(llc_cpus, cpu_online_mask)) > + enable_llc = false; The second test looks like it should always be correct. I'm not sure the first one is. It probably is good enough but would at least be worthwhile to explain why that is? Thanks. -- tejun
On Tue, Nov 05, 2024 at 01:32:05PM -1000, Tejun Heo wrote: > Hello, Andrea. > > Sorry about the delay. > > On Thu, Oct 31, 2024 at 08:34:01AM +0100, Andrea Righi wrote: > > @@ -3154,16 +3154,29 @@ static void update_selcpu_topology(void) > > rcu_read_lock(); > > sd = rcu_dereference(per_cpu(sd_llc, cpu)); > > if (sd) { > > - cpus = sched_domain_span(sd); > > - if (cpumask_weight(cpus) < num_possible_cpus()) > > + llc_cpus = sched_domain_span(sd); > > + if (cpumask_weight(llc_cpus) < num_possible_cpus()) > > Not from this patch but should the weight be compared against > num_online_cpus()? Sched domains don't include offline CPUs, right? That's right, sched domain is definitely updated when a CPU goes offine (I've just verified), so we should check for num_online_cpus() here. Thanks for noticing it! > > ... > > + /* > > + * If the NUMA domain perfectly overlaps with the LLC domain, enable > > + * LLC optimization only, as checking for an idle CPU in the same > > + * domain twice is redundant. > > + */ > > + if (enable_numa && enable_llc && cpumask_equal(numa_cpus, llc_cpus)) > > + enable_numa = false; > > + /* > > + * If all the online CPUs are in the same LLC domain, there is no > > + * reason to enable LLC optimization. > > + */ > > + if (enable_llc && cpumask_equal(llc_cpus, cpu_online_mask)) > > + enable_llc = false; > > The second test looks like it should always be correct. I'm not sure the > first one is. It probably is good enough but would at least be worthwhile to > explain why that is? Let's say we have 2 NUMA nodes, each with 2 sockets, and each socket has its own L3 cache. In this case, numa_cpus will be larger than llc_cpus, and enabling both NUMA and LLC optimizations would be beneficial. On the other hand, if each NUMA node contains only 1 socket, numa_cpus and llc_cpus will overlap completely, making it unnecessary to enable both NUMA and LLC optimizations, so we can have just the LLC in this case. Would something like this help clarifying the first test? Thanks, -Andrea
Hello, On Wed, Nov 06, 2024 at 01:29:08AM +0100, Andrea Righi wrote: ... > Let's say we have 2 NUMA nodes, each with 2 sockets, and each socket > has its own L3 cache. In this case, numa_cpus will be larger than > llc_cpus, and enabling both NUMA and LLC optimizations would be > beneficial. > > On the other hand, if each NUMA node contains only 1 socket, numa_cpus > and llc_cpus will overlap completely, making it unnecessary to enable > both NUMA and LLC optimizations, so we can have just the LLC in this > case. > > Would something like this help clarifying the first test? I was more thinking about the theoretical case where one socket has one LLC while a different socket has multiple LLCs. I don't think there are any systems which are actually like that but there's nothing in the code which prevents that (unlike a single CPU belonging to multiple domains), so it'd probably be worthwhile to explain why the abbreviated test is enough. Thanks. -- tejun
On Tue, Nov 05, 2024 at 02:33:53PM -1000, Tejun Heo wrote: > Hello, > > On Wed, Nov 06, 2024 at 01:29:08AM +0100, Andrea Righi wrote: > ... > > Let's say we have 2 NUMA nodes, each with 2 sockets, and each socket > > has its own L3 cache. In this case, numa_cpus will be larger than > > llc_cpus, and enabling both NUMA and LLC optimizations would be > > beneficial. > > > > On the other hand, if each NUMA node contains only 1 socket, numa_cpus > > and llc_cpus will overlap completely, making it unnecessary to enable > > both NUMA and LLC optimizations, so we can have just the LLC in this > > case. > > > > Would something like this help clarifying the first test? > > I was more thinking about the theoretical case where one socket has one LLC > while a different socket has multiple LLCs. I don't think there are any > systems which are actually like that but there's nothing in the code which > prevents that (unlike a single CPU belonging to multiple domains), so it'd > probably be worthwhile to explain why the abbreviated test is enough. In theory a CPU can only belong to a single domain (otherwise other stuff in topology.c are broken as well), but potentially we could have something like: NUMA 1 - CPU 1 (L3) NUMA 2 - CPU 2 (L3) - CPU 3 (L3) If we inspect CPU 1 only we may incorrectly assume that numa_cpus == llc_cpus. To handle this properly we may have to inspect all the CPUs, instead of just the first one. Moreover, with qemu we can also simulate ugly topologies like 2 NUMA nodes and 1 L3 cache that covers the 2 NUMA nodes: arighi@gpd3~/s/linux (master)> vng --cpu 4 -m 4G --numa 2G,cpus=0-1 --numa 2G,cpus=2-3 ... arighi@virtme-ng~/s/linux (master)> lscpu -e CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE 0 0 0 0 0:0:0:0 yes 1 0 0 1 1:1:1:0 yes 2 1 0 2 2:2:2:0 yes 3 1 0 3 3:3:3:0 yes arighi@virtme-ng~/s/linux (master)> numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 node 0 size: 2014 MB node 0 free: 1931 MB node 1 cpus: 2 3 node 1 size: 1896 MB node 1 free: 1847 MB node distances: node 0 1 0: 10 20 1: 20 10 I think this is only possible in a virtualized environment, in this case LLC should be disabled and NUMA enabled. Maybe it's worth checking also for the case where LLC > NUMA... -Andrea
Hello, Andrea. On Wed, Nov 06, 2024 at 02:08:00AM +0100, Andrea Righi wrote: ... > I think this is only possible in a virtualized environment, in this case > LLC should be disabled and NUMA enabled. Maybe it's worth checking also > for the case where LLC > NUMA... I think the current code is okay. It just needs a short comment explaining why it's so. A cleaner alternative may be walking the llc sd's and verify whether any two llcs share the same node sd as an ancestor. Thanks. -- tejun
On Tue, Nov 05, 2024 at 03:15:59PM -1000, Tejun Heo wrote: > Hello, Andrea. > > On Wed, Nov 06, 2024 at 02:08:00AM +0100, Andrea Righi wrote: > ... > > I think this is only possible in a virtualized environment, in this case > > LLC should be disabled and NUMA enabled. Maybe it's worth checking also > > for the case where LLC > NUMA... > > I think the current code is okay. It just needs a short comment explaining > why it's so. A cleaner alternative may be walking the llc sd's and verify > whether any two llcs share the same node sd as an ancestor. Or maybe just compare the total number of LLC sd's and NUMA sd's. Thanks. -- tejun
© 2016 - 2024 Red Hat, Inc.