[v2] sched_ext: Introduce NUMA awareness to the default idle selection policy

[PATCH v2] sched_ext: Introduce NUMA awareness to the default idle selection policy

Posted by Andrea Righi 1 month ago

Similarly to commit dfa4ed29b18c ("sched_ext: Introduce LLC awareness to
the default idle selection policy"), extend the built-in idle CPU
selection policy to also prioritize CPUs within the same NUMA node.

With this change applied, the built-in CPU idle selection policy follows
this logic:
 - always prioritize CPUs from fully idle SMT cores,
 - select the same CPU if possible,
 - select a CPU within the same LLC domain,
 - select a CPU within the same NUMA node.

Both NUMA and LLC awareness features are enabled only when the system
has multiple NUMA nodes or multiple LLC domains.

In the future, we may want to improve the NUMA node selection to account
the node distance from prev_cpu. Currently, the logic only tries to keep
tasks running on the same NUMA node. If all CPUs within a node are busy,
the next NUMA node is chosen randomly.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 118 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 101 insertions(+), 17 deletions(-)

ChangeLog v1 -> v2:
  - autodetect at boot whether NUMA and LLC capabilities should be used
    and use static_keys to control their activation
  - rely on cpumask_of_node/cpu_to_node() to determine the NUMA domain

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d7ae816db6f2..af2ffafda296 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -869,6 +869,8 @@ static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last);
 static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting);
 static DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt);
 static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
+static DEFINE_STATIC_KEY_FALSE(scx_topology_llc);
+static DEFINE_STATIC_KEY_FALSE(scx_topology_numa);
 
 static struct static_key_false scx_has_op[SCX_OPI_END] =
 	{ [0 ... SCX_OPI_END-1] = STATIC_KEY_FALSE_INIT };
@@ -3124,31 +3126,68 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
 		goto retry;
 }
 
-#ifdef CONFIG_SCHED_MC
 /*
- * Return the cpumask of CPUs usable by task @p in the same LLC domain of @cpu,
- * or NULL if the LLC domain cannot be determined.
+ * Initialize topology-aware scheduling.
  */
-static const struct cpumask *llc_domain(const struct task_struct *p, s32 cpu)
+static void init_topology(void)
 {
-	struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
-	const struct cpumask *llc_cpus = sd ? sched_domain_span(sd) : NULL;
+	const struct cpumask *cpus;
+	int nid;
+	s32 cpu;
+
+	/*
+	 * Detect if the system has multiple NUMA nodes distributed across the
+	 * available CPUs and, in that case, enable NUMA-aware scheduling in
+	 * the default CPU idle selection policy.
+	 */
+	for_each_node(nid) {
+		cpus = cpumask_of_node(nid);
+		if (cpumask_weight(cpus) < nr_cpu_ids) {
+			static_branch_enable(&scx_topology_numa);
+			pr_devel("sched_ext: NUMA scheduling enabled");
+			break;
+		}
+	}
 
 	/*
-	 * Return the LLC domain only if the task is allowed to run on all
-	 * CPUs.
+	 * Detect if the system has multiple LLC domains and enable cache-aware
+	 * scheduling in the default CPU idle selection policy.
 	 */
-	return p->nr_cpus_allowed == nr_cpu_ids ? llc_cpus : NULL;
-}
-#else /* CONFIG_SCHED_MC */
-static inline const struct cpumask *llc_domain(struct task_struct *p, s32 cpu)
-{
-	return NULL;
+	for_each_possible_cpu(cpu) {
+		struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
+
+		if (!sd)
+			continue;
+		cpus = sched_domain_span(sd);
+		if (cpumask_weight(cpus) < nr_cpu_ids) {
+			static_branch_enable(&scx_topology_llc);
+			pr_devel("sched_ext: LLC scheduling enabled");
+			break;
+		}
+	}
 }
-#endif /* CONFIG_SCHED_MC */
 
 /*
- * Built-in cpu idle selection policy.
+ * Built-in CPU idle selection policy:
+ *
+ * 1. Prioritize full-idle cores:
+ *   - always prioritize CPUs from fully idle cores (both logical CPUs are
+ *     idle) to avoid interference caused by SMT.
+ *
+ * 2. Reuse the same CPU:
+ *   - prefer the last used CPU to take advantage of cached data (L1, L2) and
+ *     branch prediction optimizations.
+ *
+ * 3. Pick a CPU within the same LLC (Last-Level Cache):
+ *   - if the above conditions aren't met, pick a CPU that shares the same LLC
+ *     to maintain cache locality.
+ *
+ * 4. Pick a CPU within the same NUMA node, if enabled:
+ *   - choose a CPU from the same NUMA node to reduce memory access latency.
+ *
+ * Step 3 and 4 are performed only if the system has, respectively, multiple
+ * LLC domains / multiple NUMA nodes (see scx_topology_llc and
+ * scx_topology_numa).
  *
  * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
  * we never call ops.select_cpu() for them, see select_task_rq().
@@ -3156,7 +3195,8 @@ static inline const struct cpumask *llc_domain(struct task_struct *p, s32 cpu)
 static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			      u64 wake_flags, bool *found)
 {
-	const struct cpumask *llc_cpus = llc_domain(p, prev_cpu);
+	const struct cpumask *llc_cpus = NULL;
+	const struct cpumask *numa_cpus = NULL;
 	s32 cpu;
 
 	*found = false;
@@ -3166,6 +3206,30 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 		return prev_cpu;
 	}
 
+	/*
+	 * Determine the scheduling domain only if the task is allowed to run
+	 * on all CPUs.
+	 *
+	 * This is done primarily for efficiency, as it avoids the overhead of
+	 * updating a cpumask every time we need to select an idle CPU (which
+	 * can be costly in large SMP systems), but it also aligns logically:
+	 * if a task's scheduling domain is restricted by user-space (through
+	 * CPU affinity), the task will simply use the flat scheduling domain
+	 * defined by user-space.
+	 */
+	if (p->nr_cpus_allowed == nr_cpu_ids) {
+		if (static_branch_unlikely(&scx_topology_numa))
+			numa_cpus = cpumask_of_node(cpu_to_node(prev_cpu));
+
+		if (static_branch_unlikely(&scx_topology_llc)) {
+			struct sched_domain *sd;
+
+			sd = rcu_dereference(per_cpu(sd_llc, prev_cpu));
+			if (sd)
+				llc_cpus = sched_domain_span(sd);
+		}
+	}
+
 	/*
 	 * If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
 	 */
@@ -3226,6 +3290,15 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 				goto cpu_found;
 		}
 
+		/*
+		 * Search for any fully idle core in the same NUMA node.
+		 */
+		if (numa_cpus) {
+			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
+			if (cpu >= 0)
+				goto cpu_found;
+		}
+
 		/*
 		 * Search for any full idle core usable by the task.
 		 */
@@ -3251,6 +3324,15 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			goto cpu_found;
 	}
 
+	/*
+	 * Search for any idle CPU in the same NUMA node.
+	 */
+	if (numa_cpus) {
+		cpu = scx_pick_idle_cpu(numa_cpus, 0);
+		if (cpu >= 0)
+			goto cpu_found;
+	}
+
 	/*
 	 * Search for any idle CPU usable by the task.
 	 */
@@ -7315,6 +7397,8 @@ static int __init scx_init(void)
 		return ret;
 	}
 
+	init_topology();
+
 	return 0;
 }
 __initcall(scx_init);
-- 
2.47.0

Re: [PATCH v2] sched_ext: Introduce NUMA awareness to the default idle selection policy

Posted by Tejun Heo 1 month ago

Hello,

On Fri, Oct 25, 2024 at 06:25:35PM +0200, Andrea Righi wrote:
...
> +static DEFINE_STATIC_KEY_FALSE(scx_topology_llc);
> +static DEFINE_STATIC_KEY_FALSE(scx_topology_numa);

Maybe name them sth like scx_selcpu_topo_llc given that this is only used by
selcpu?

> +static void init_topology(void)

Ditto with naming.

>  {
> -	struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
> -	const struct cpumask *llc_cpus = sd ? sched_domain_span(sd) : NULL;
> +	const struct cpumask *cpus;
> +	int nid;
> +	s32 cpu;
> +
> +	/*
> +	 * Detect if the system has multiple NUMA nodes distributed across the
> +	 * available CPUs and, in that case, enable NUMA-aware scheduling in
> +	 * the default CPU idle selection policy.
> +	 */
> +	for_each_node(nid) {
> +		cpus = cpumask_of_node(nid);
> +		if (cpumask_weight(cpus) < nr_cpu_ids) {

Comparing number of cpus with nr_cpu_ids doesn't work. The above condition
can trigger on single node machines with some CPUs offlines or unavailable
for example. I think num_node_state(N_CPU) should work or if you want to
keep with sched_domains, maybe highest_flag_domain(some_cpu,
SD_NUMA)->groups->weight would work?

...
> +	for_each_possible_cpu(cpu) {
> +		struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
> +
> +		if (!sd)
> +			continue;
> +		cpus = sched_domain_span(sd);
> +		if (cpumask_weight(cpus) < nr_cpu_ids) {

Ditto. 

...
> +	/*
> +	 * Determine the scheduling domain only if the task is allowed to run
> +	 * on all CPUs.
> +	 *
> +	 * This is done primarily for efficiency, as it avoids the overhead of
> +	 * updating a cpumask every time we need to select an idle CPU (which
> +	 * can be costly in large SMP systems), but it also aligns logically:
> +	 * if a task's scheduling domain is restricted by user-space (through
> +	 * CPU affinity), the task will simply use the flat scheduling domain
> +	 * defined by user-space.
> +	 */
> +	if (p->nr_cpus_allowed == nr_cpu_ids) {

Should compare against nr_possible_cpus.

Thanks.

-- 
tejun

Re: [PATCH v2] sched_ext: Introduce NUMA awareness to the default idle selection policy

Posted by Andrea Righi 4 weeks, 1 day ago

On Fri, Oct 25, 2024 at 10:02:31AM -1000, Tejun Heo wrote:
> External email: Use caution opening links or attachments
> 
> 
> Hello,
> 
> On Fri, Oct 25, 2024 at 06:25:35PM +0200, Andrea Righi wrote:
> ...
> > +static DEFINE_STATIC_KEY_FALSE(scx_topology_llc);
> > +static DEFINE_STATIC_KEY_FALSE(scx_topology_numa);
> 
> Maybe name them sth like scx_selcpu_topo_llc given that this is only used by
> selcpu?

Ok.

> 
> > +static void init_topology(void)
> 
> Ditto with naming.

Ok.

> 
> >  {
> > -     struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
> > -     const struct cpumask *llc_cpus = sd ? sched_domain_span(sd) : NULL;
> > +     const struct cpumask *cpus;
> > +     int nid;
> > +     s32 cpu;
> > +
> > +     /*
> > +      * Detect if the system has multiple NUMA nodes distributed across the
> > +      * available CPUs and, in that case, enable NUMA-aware scheduling in
> > +      * the default CPU idle selection policy.
> > +      */
> > +     for_each_node(nid) {
> > +             cpus = cpumask_of_node(nid);
> > +             if (cpumask_weight(cpus) < nr_cpu_ids) {
> 
> Comparing number of cpus with nr_cpu_ids doesn't work. The above condition
> can trigger on single node machines with some CPUs offlines or unavailable
> for example. I think num_node_state(N_CPU) should work or if you want to
> keep with sched_domains, maybe highest_flag_domain(some_cpu,
> SD_NUMA)->groups->weight would work?

Ok, checking num_possible_cpus() instead of nr_cpu_ids makes more sense.

I was also thinking to refresh the static keys on hotplug events and
check for num_possible_cpus(), in this way the topology optimizations
should be always (more) consistent, even when some of the CPUs are going
offline/online. Old tasks won't update their p->nr_cpus_allowed I guess,
but worst case they may miss some NUMA/LLC optimizations. Maybe we can
add a generation counter and rely on scx_hotplug_seq to handle this case
in a more precise way (like updating a local cpumask), but it seems a
bit overkill...

About node_state(nid, N_CPU), I've done some tests and it doesn't seem
to work well for this scenario: it correctly returns 0 in case of
memory-only NUMA nodes, but for example if I start a VM with a single
NUMA node and I assign all the CPUs to that node, node_state(nid, N_CPU)
returns 1 (correctly), but in our case the node should be considered
like a memory-only node, since it includes all the possible CPUs.

I've also tried to rely on sd_numa (similar to sd_llc), but it also
doesn't seem to work as expected (this might be a bug? I'll investigate
separately), because if I start a VM with 2 NUMA nodes (assigning half
of the CPUs to node 1 and the other half to node 2), sd_numa still
reports all CPUs assigned to the same node.

Instead, highest_flag_domain(cpu, SD_NUMA)->groups seems to work as
expected, and since the logic is also based on sched_domain like the LLC
one, I definitely prefer this approach, thanks for the suggestions!

-Andrea

> 
> ...
> > +     for_each_possible_cpu(cpu) {
> > +             struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
> > +
> > +             if (!sd)
> > +                     continue;
> > +             cpus = sched_domain_span(sd);
> > +             if (cpumask_weight(cpus) < nr_cpu_ids) {
> 
> Ditto.
> 
> ...
> > +     /*
> > +      * Determine the scheduling domain only if the task is allowed to run
> > +      * on all CPUs.
> > +      *
> > +      * This is done primarily for efficiency, as it avoids the overhead of
> > +      * updating a cpumask every time we need to select an idle CPU (which
> > +      * can be costly in large SMP systems), but it also aligns logically:
> > +      * if a task's scheduling domain is restricted by user-space (through
> > +      * CPU affinity), the task will simply use the flat scheduling domain
> > +      * defined by user-space.
> > +      */
> > +     if (p->nr_cpus_allowed == nr_cpu_ids) {
> 
> Should compare against nr_possible_cpus.
> 
> Thanks.
> 
> --
> tejun