[v3] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

[PATCH v3 sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Andrea Righi 1 year, 3 months ago

When the LLC and NUMA domains fully overlap, enabling both optimizations
in the built-in idle CPU selection policy is redundant, as it leads to
searching for an idle CPU within the same domain twice.

Likewise, if all online CPUs are within a single LLC domain, LLC
optimization is unnecessary.

Therefore, detect overlapping domains and enable topology optimizations
only when necessary.

Moreover, rely on the online CPUs for this detection logic, instead of
using the possible CPUs.

Fixes: 860a45219bce ("sched_ext: Introduce NUMA awareness to the default idle selection policy")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 85 +++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 72 insertions(+), 13 deletions(-)

ChangeLog v2 -> v3:
  - consistently use cpumask_of_node(cpu_to_node(cpu)) to determine the
    NUMA domain, instead of relying on the NUMA sched_domain, that
    seems to incorrectly include all CPUs in certain NUMA configurations
    (I'll investigate more)
  - use sd->span_weight instead of re-evaluating it via cpumask_weight()
  - clarify comment about CPU -> LLC -> NUMA hierarchy

ChangeLog v1 -> v2:
  - rely on the online CPUs, instead of the possible CPUs
  - handle asymmetric NUMA configurations (which may arise from CPU
    hotplugging or virtualization)
  - add more comments to clarify the possible LLC/NUMA scenarios

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index fc7f15eefe54..f154aaeb69e4 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3129,12 +3129,63 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
 		goto retry;
 }
 
+/*
+ * Return true if the LLC domains do not perfectly overlap with the NUMA
+ * domains, false otherwise.
+ */
+static bool llc_numa_mismatch(void)
+{
+	int cpu;
+
+	/*
+	 * We need to scan all online CPUs to verify whether their scheduling
+	 * domains overlap.
+	 *
+	 * While it is rare to encounter architectures with asymmetric NUMA
+	 * topologies, CPU hotplugging or virtualized environments can result
+	 * in asymmetric configurations.
+	 *
+	 * For example:
+	 *
+	 *  NUMA 0:
+	 *    - LLC 0: cpu0..cpu7
+	 *    - LLC 1: cpu8..cpu15 [offline]
+	 *
+	 *  NUMA 1:
+	 *    - LLC 0: cpu16..cpu23
+	 *    - LLC 1: cpu24..cpu31
+	 *
+	 * In this case, if we only check the first online CPU (cpu0), we might
+	 * incorrectly assume that the LLC and NUMA domains are fully
+	 * overlapping, which is incorrect (as NUMA 1 has two distinct LLC
+	 * domains).
+	 */
+	for_each_online_cpu(cpu) {
+		const struct cpumask *numa_cpus;
+		struct sched_domain *sd;
+
+		sd = rcu_dereference(per_cpu(sd_llc, cpu));
+		if (!sd)
+			return true;
+
+		numa_cpus = cpumask_of_node(cpu_to_node(cpu));
+		if (sd->span_weight != cpumask_weight(numa_cpus))
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * Initialize topology-aware scheduling.
  *
  * Detect if the system has multiple LLC or multiple NUMA domains and enable
  * cache-aware / NUMA-aware scheduling optimizations in the default CPU idle
  * selection policy.
+ *
+ * Assumption: the kernel's internal topology representation assumes that each
+ * CPU belongs to a single LLC domain, and that each LLC domain is entirely
+ * contained within a single NUMA node.
  */
 static void update_selcpu_topology(void)
 {
@@ -3144,26 +3195,34 @@ static void update_selcpu_topology(void)
 	s32 cpu = cpumask_first(cpu_online_mask);
 
 	/*
-	 * We only need to check the NUMA node and LLC domain of the first
-	 * available CPU to determine if they cover all CPUs.
+	 * Enable LLC domain optimization only when there are multiple LLC
+	 * domains among the online CPUs. If all online CPUs are part of a
+	 * single LLC domain, the idle CPU selection logic can choose any
+	 * online CPU without bias.
 	 *
-	 * If all CPUs belong to the same NUMA node or share the same LLC
-	 * domain, enabling NUMA or LLC optimizations is unnecessary.
-	 * Otherwise, these optimizations can be enabled.
+	 * Note that it is sufficient to check the LLC domain of the first
+	 * online CPU to determine whether a single LLC domain includes all
+	 * CPUs.
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_llc, cpu));
 	if (sd) {
-		cpus = sched_domain_span(sd);
-		if (cpumask_weight(cpus) < num_possible_cpus())
+		if (sd->span_weight < num_online_cpus())
 			enable_llc = true;
 	}
-	sd = highest_flag_domain(cpu, SD_NUMA);
-	if (sd) {
-		cpus = sched_group_span(sd->groups);
-		if (cpumask_weight(cpus) < num_possible_cpus())
-			enable_numa = true;
-	}
+
+	/*
+	 * Enable NUMA optimization only when there are multiple NUMA domains
+	 * among the online CPUs and the NUMA domains don't perfectly overlaps
+	 * with the LLC domains.
+	 *
+	 * If all CPUs belong to the same NUMA node and the same LLC domain,
+	 * enabling both NUMA and LLC optimizations is unnecessary, as checking
+	 * for an idle CPU in the same domain twice is redundant.
+	 */
+	cpus = cpumask_of_node(cpu_to_node(cpu));
+	if ((cpumask_weight(cpus) < num_online_cpus()) & llc_numa_mismatch())
+		enable_numa = true;
 	rcu_read_unlock();
 
 	pr_debug("sched_ext: LLC idle selection %s\n",
-- 
2.47.0

Re: [PATCH v3 sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Nathan Chancellor 1 year, 3 months ago

Hi Andrea,

On Fri, Nov 08, 2024 at 01:01:36AM +0100, Andrea Righi wrote:
...
> +	/*
> +	 * Enable NUMA optimization only when there are multiple NUMA domains
> +	 * among the online CPUs and the NUMA domains don't perfectly overlaps
> +	 * with the LLC domains.
> +	 *
> +	 * If all CPUs belong to the same NUMA node and the same LLC domain,
> +	 * enabling both NUMA and LLC optimizations is unnecessary, as checking
> +	 * for an idle CPU in the same domain twice is redundant.
> +	 */
> +	cpus = cpumask_of_node(cpu_to_node(cpu));
> +	if ((cpumask_weight(cpus) < num_online_cpus()) & llc_numa_mismatch())
> +		enable_numa = true;

With this hunk in next-20241108, I am seeing a clang warning (or error
since CONFIG_WERROR=y):

  In file included from kernel/sched/build_policy.c:63:
  kernel/sched/ext.c:3252:6: error: use of bitwise '&' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
   3252 |         if ((cpumask_weight(cpus) < num_online_cpus()) & llc_numa_mismatch())
        |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        |                                                        &&
  kernel/sched/ext.c:3252:6: note: cast one or both operands to int to silence this warning
  1 error generated.

Was use of a bitwise AND here intentional (i.e., should
llc_num_mismatch() always be called regardless of the outcome of the
first condition) or can it be switched to a logical AND to silence the
warning? I do not mind sending a patch but I did not want to be wrong
off bat. If there is some other better solution that I am not seeing,
please feel free to send a patch with this as just a report.

Cheers,
Nathan

Re: [PATCH v3 sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Tejun Heo 1 year, 3 months ago

On Fri, Nov 08, 2024 at 11:17:53AM -0700, Nathan Chancellor wrote:
> Hi Andrea,
> 
> On Fri, Nov 08, 2024 at 01:01:36AM +0100, Andrea Righi wrote:
> ...
> > +	/*
> > +	 * Enable NUMA optimization only when there are multiple NUMA domains
> > +	 * among the online CPUs and the NUMA domains don't perfectly overlaps
> > +	 * with the LLC domains.
> > +	 *
> > +	 * If all CPUs belong to the same NUMA node and the same LLC domain,
> > +	 * enabling both NUMA and LLC optimizations is unnecessary, as checking
> > +	 * for an idle CPU in the same domain twice is redundant.
> > +	 */
> > +	cpus = cpumask_of_node(cpu_to_node(cpu));
> > +	if ((cpumask_weight(cpus) < num_online_cpus()) & llc_numa_mismatch())
> > +		enable_numa = true;
> 
> With this hunk in next-20241108, I am seeing a clang warning (or error
> since CONFIG_WERROR=y):
> 
>   In file included from kernel/sched/build_policy.c:63:
>   kernel/sched/ext.c:3252:6: error: use of bitwise '&' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
>    3252 |         if ((cpumask_weight(cpus) < num_online_cpus()) & llc_numa_mismatch())
>         |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>         |                                                        &&
>   kernel/sched/ext.c:3252:6: note: cast one or both operands to int to silence this warning
>   1 error generated.
> 
> Was use of a bitwise AND here intentional (i.e., should
> llc_num_mismatch() always be called regardless of the outcome of the
> first condition) or can it be switched to a logical AND to silence the
> warning? I do not mind sending a patch but I did not want to be wrong
> off bat. If there is some other better solution that I am not seeing,
> please feel free to send a patch with this as just a report.

Oops, that looks like a mistake. I don't see why it can't be &&.

Thanks.

-- 
tejun

Re: [PATCH v3 sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Andrea Righi 1 year, 3 months ago

On Fri, Nov 08, 2024 at 08:54:33AM -1000, Tejun Heo wrote:
> On Fri, Nov 08, 2024 at 11:17:53AM -0700, Nathan Chancellor wrote:
> > Hi Andrea,
> >
> > On Fri, Nov 08, 2024 at 01:01:36AM +0100, Andrea Righi wrote:
> > ...
> > > +   /*
> > > +    * Enable NUMA optimization only when there are multiple NUMA domains
> > > +    * among the online CPUs and the NUMA domains don't perfectly overlaps
> > > +    * with the LLC domains.
> > > +    *
> > > +    * If all CPUs belong to the same NUMA node and the same LLC domain,
> > > +    * enabling both NUMA and LLC optimizations is unnecessary, as checking
> > > +    * for an idle CPU in the same domain twice is redundant.
> > > +    */
> > > +   cpus = cpumask_of_node(cpu_to_node(cpu));
> > > +   if ((cpumask_weight(cpus) < num_online_cpus()) & llc_numa_mismatch())
> > > +           enable_numa = true;
> >
> > With this hunk in next-20241108, I am seeing a clang warning (or error
> > since CONFIG_WERROR=y):
> >
> >   In file included from kernel/sched/build_policy.c:63:
> >   kernel/sched/ext.c:3252:6: error: use of bitwise '&' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
> >    3252 |         if ((cpumask_weight(cpus) < num_online_cpus()) & llc_numa_mismatch())
> >         |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >         |                                                        &&
> >   kernel/sched/ext.c:3252:6: note: cast one or both operands to int to silence this warning
> >   1 error generated.
> >
> > Was use of a bitwise AND here intentional (i.e., should
> > llc_num_mismatch() always be called regardless of the outcome of the
> > first condition) or can it be switched to a logical AND to silence the
> > warning? I do not mind sending a patch but I did not want to be wrong
> > off bat. If there is some other better solution that I am not seeing,
> > please feel free to send a patch with this as just a report.
> 
> Oops, that looks like a mistake. I don't see why it can't be &&.

Sorry, this is a mistake, it definitely needs to be &&.

Do you want me to send a fix on top of this one or a v4?

Thanks,
-Andrea

Re: [PATCH v3 sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Tejun Heo 1 year, 3 months ago

On Fri, Nov 08, 2024 at 08:39:15PM +0100, Andrea Righi wrote:
> Sorry, this is a mistake, it definitely needs to be &&.
> 
> Do you want me to send a fix on top of this one or a v4?

An incremental fix patch would be great.

Thanks.

-- 
tejun

Re: [PATCH v3 sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Andrea Righi 1 year, 3 months ago

On Fri, Nov 08, 2024 at 09:42:44AM -1000, Tejun Heo wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Nov 08, 2024 at 08:39:15PM +0100, Andrea Righi wrote:
> > Sorry, this is a mistake, it definitely needs to be &&.
> >
> > Do you want me to send a fix on top of this one or a v4?
> 
> An incremental fix patch would be great.
> 
> Thanks.

Sent. Thanks Nathan for noticing it!

-Andrea

Re: [PATCH v3 sched_ext/for-6.13] sched_ext: Do not enable LLC/NUMA optimizations when domains overlap

Posted by Tejun Heo 1 year, 3 months ago

On Fri, Nov 08, 2024 at 01:01:36AM +0100, Andrea Righi wrote:
> When the LLC and NUMA domains fully overlap, enabling both optimizations
> in the built-in idle CPU selection policy is redundant, as it leads to
> searching for an idle CPU within the same domain twice.
> 
> Likewise, if all online CPUs are within a single LLC domain, LLC
> optimization is unnecessary.
> 
> Therefore, detect overlapping domains and enable topology optimizations
> only when necessary.
> 
> Moreover, rely on the online CPUs for this detection logic, instead of
> using the possible CPUs.
> 
> Fixes: 860a45219bce ("sched_ext: Introduce NUMA awareness to the default idle selection policy")
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

Applied to sched_ext/for-6.13.

Thanks.

-- 
tejun