[v1] sched_ext: Introduce NUMA awareness to the default idle selection policy

[PATCH] sched_ext: Introduce NUMA awareness to the default idle selection policy

Posted by Andrea Righi 1 month ago

Similarly to commit dfa4ed29b18c ("sched_ext: Introduce LLC awareness to
the default idle selection policy"), extend the built-in idle CPU
selection policy to also prioritize CPUs within the same NUMA node.

With this change applied, the built-in CPU idle selection policy follows
this logic:
 - always prioritize CPUs from fully idle SMT cores,
 - select the same CPU if possible,
 - select a CPU within the same LLC domain,
 - select a CPU within the same NUMA node.

Note that LLC and NUMA awareness optimizations are only applied when
CONFIG_SCHED_MC is enabled.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 97 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 85 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d7ae816db6f2..cdc6094893db 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3124,31 +3124,85 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
 		goto retry;
 }
 
+/*
+ * Scheduling domain types used for idle CPU selection.
+ */
+enum scx_domain_type {
+	SCX_DOM_LLC,		/* Use the same last-level cache (LLC) */
+	SCX_DOM_NUMA,		/* Use the same NUMA node */
+};
+
 #ifdef CONFIG_SCHED_MC
 /*
- * Return the cpumask of CPUs usable by task @p in the same LLC domain of @cpu,
- * or NULL if the LLC domain cannot be determined.
+ * Return the cpumask of CPUs usable by task @p in the same domain of @cpu, or
+ * NULL if the domain cannot be determined.
  */
-static const struct cpumask *llc_domain(const struct task_struct *p, s32 cpu)
+static const struct cpumask *
+scx_domain(const struct task_struct *p, s32 cpu, enum scx_domain_type type)
 {
-	struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
-	const struct cpumask *llc_cpus = sd ? sched_domain_span(sd) : NULL;
+	struct sched_domain *sd;
 
 	/*
-	 * Return the LLC domain only if the task is allowed to run on all
-	 * CPUs.
-	 */
-	return p->nr_cpus_allowed == nr_cpu_ids ? llc_cpus : NULL;
+	 * Determine the scheduling domain only if the task is allowed to run
+	 * on all CPUs.
+	 *
+	 * This is done primarily for efficiency, as it avoids the overhead of
+	 * updating a cpumask every time we need to select an idle CPU (which
+	 * can be costly in large SMP systems), but it also aligns logically:
+	 * if a task's scheduling domain is restricted by user-space (through
+	 * CPU affinity), the task will simply use the flat scheduling domain
+	 * defined by user-space.
+	 */
+	if (p->nr_cpus_allowed < nr_cpu_ids)
+		return NULL;
+
+	switch (type) {
+	case SCX_DOM_LLC:
+		sd = rcu_dereference(per_cpu(sd_llc, cpu));
+		break;
+	case SCX_DOM_NUMA:
+		sd = rcu_dereference(per_cpu(sd_numa, cpu));
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		sd = NULL;
+	}
+	if (!sd)
+		return NULL;
+
+	return sched_domain_span(sd);
 }
 #else /* CONFIG_SCHED_MC */
-static inline const struct cpumask *llc_domain(struct task_struct *p, s32 cpu)
+static const struct cpumask *
+scx_domain(const struct task_struct *p, s32 cpu, enum scx_domain_type type)
 {
 	return NULL;
 }
 #endif /* CONFIG_SCHED_MC */
 
 /*
- * Built-in cpu idle selection policy.
+ * Built-in CPU idle selection policy:
+ *
+ * 1. Prioritize full-idle cores:
+ *   - always prioritize CPUs from fully idle cores (both logical CPUs are
+ *     idle) to avoid interference caused by SMT.
+ *
+ * 2. Reuse the same CPU:
+ *   - prefer the last used CPU to take advantage of cached data (L1, L2) and
+ *     branch prediction optimizations.
+ *
+ * 3. Pick a CPU within the same LLC (Last-Level Cache):
+ *   - if the above conditions aren't met, pick a CPU that shares the same LLC
+ *     to maintain cache locality.
+ *
+ * 4. Pick a CPU within the same NUMA Node:
+ *   - choose a CPU from the same NUMA node to reduce memory access latency.
+ *
+ * In most architectures the NUMA domain and LLC domain overlap. In this case,
+ * making an additional attempt to find an idle CPU within the same domain
+ * might seem redundant, but the overhead is minimal and it can be beneficial,
+ * as it increases the chance of selecting a close CPU that may have just
+ * become idle.
  *
  * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
  * we never call ops.select_cpu() for them, see select_task_rq().
@@ -3156,7 +3210,8 @@ static inline const struct cpumask *llc_domain(struct task_struct *p, s32 cpu)
 static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			      u64 wake_flags, bool *found)
 {
-	const struct cpumask *llc_cpus = llc_domain(p, prev_cpu);
+	const struct cpumask *llc_cpus = scx_domain(p, prev_cpu, SCX_DOM_LLC);
+	const struct cpumask *numa_cpus = scx_domain(p, prev_cpu, SCX_DOM_NUMA);
 	s32 cpu;
 
 	*found = false;
@@ -3226,6 +3281,15 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 				goto cpu_found;
 		}
 
+		/*
+		 * Search for any fully idle core in the same NUMA node.
+		 */
+		if (numa_cpus) {
+			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
+			if (cpu >= 0)
+				goto cpu_found;
+		}
+
 		/*
 		 * Search for any full idle core usable by the task.
 		 */
@@ -3251,6 +3315,15 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			goto cpu_found;
 	}
 
+	/*
+	 * Search for any idle CPU in the same NUMA node.
+	 */
+	if (numa_cpus) {
+		cpu = scx_pick_idle_cpu(numa_cpus, 0);
+		if (cpu >= 0)
+			goto cpu_found;
+	}
+
 	/*
 	 * Search for any idle CPU usable by the task.
 	 */
-- 
2.47.0

Re: [PATCH] sched_ext: Introduce NUMA awareness to the default idle selection policy

Posted by Tejun Heo 1 month ago

Hello, Anrea.

On Thu, Oct 24, 2024 at 10:36:15AM +0200, Andrea Righi wrote:
> Similarly to commit dfa4ed29b18c ("sched_ext: Introduce LLC awareness to
> the default idle selection policy"), extend the built-in idle CPU
> selection policy to also prioritize CPUs within the same NUMA node.
> 
> With this change applied, the built-in CPU idle selection policy follows
> this logic:
>  - always prioritize CPUs from fully idle SMT cores,
>  - select the same CPU if possible,
>  - select a CPU within the same LLC domain,
>  - select a CPU within the same NUMA node.
> 
> Note that LLC and NUMA awareness optimizations are only applied when
> CONFIG_SCHED_MC is enabled.

The idea generally looks good to me but I have a couple quibbles about the
implementation.

...
> +enum scx_domain_type {
> +	SCX_DOM_LLC,		/* Use the same last-level cache (LLC) */
> +	SCX_DOM_NUMA,		/* Use the same NUMA node */
> +};
> +
>  #ifdef CONFIG_SCHED_MC
>  /*
> - * Return the cpumask of CPUs usable by task @p in the same LLC domain of @cpu,
> - * or NULL if the LLC domain cannot be determined.
> + * Return the cpumask of CPUs usable by task @p in the same domain of @cpu, or
> + * NULL if the domain cannot be determined.
>   */
> -static const struct cpumask *llc_domain(const struct task_struct *p, s32 cpu)
> +static const struct cpumask *
> +scx_domain(const struct task_struct *p, s32 cpu, enum scx_domain_type type)
>  {
> -	struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
> -	const struct cpumask *llc_cpus = sd ? sched_domain_span(sd) : NULL;
> +	struct sched_domain *sd;
>  
>  	/*
> -	 * Return the LLC domain only if the task is allowed to run on all
> -	 * CPUs.
> -	 */
> -	return p->nr_cpus_allowed == nr_cpu_ids ? llc_cpus : NULL;
> +	 * Determine the scheduling domain only if the task is allowed to run
> +	 * on all CPUs.
> +	 *
> +	 * This is done primarily for efficiency, as it avoids the overhead of
> +	 * updating a cpumask every time we need to select an idle CPU (which
> +	 * can be costly in large SMP systems), but it also aligns logically:
> +	 * if a task's scheduling domain is restricted by user-space (through
> +	 * CPU affinity), the task will simply use the flat scheduling domain
> +	 * defined by user-space.
> +	 */
> +	if (p->nr_cpus_allowed < nr_cpu_ids)
> +		return NULL;
> +
> +	switch (type) {
> +	case SCX_DOM_LLC:
> +		sd = rcu_dereference(per_cpu(sd_llc, cpu));
> +		break;
> +	case SCX_DOM_NUMA:
> +		sd = rcu_dereference(per_cpu(sd_numa, cpu));
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		sd = NULL;
> +	}
> +	if (!sd)
> +		return NULL;
> +
> +	return sched_domain_span(sd);
>  }
>  #else /* CONFIG_SCHED_MC */
> -static inline const struct cpumask *llc_domain(struct task_struct *p, s32 cpu)
> +static const struct cpumask *
> +scx_domain(const struct task_struct *p, s32 cpu, enum scx_domain_type type)
>  {
>  	return NULL;
>  }
>  #endif /* CONFIG_SCHED_MC */
...
> @@ -3156,7 +3210,8 @@ static inline const struct cpumask *llc_domain(struct task_struct *p, s32 cpu)
>  static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  			      u64 wake_flags, bool *found)
>  {
> -	const struct cpumask *llc_cpus = llc_domain(p, prev_cpu);
> +	const struct cpumask *llc_cpus = scx_domain(p, prev_cpu, SCX_DOM_LLC);
> +	const struct cpumask *numa_cpus = scx_domain(p, prev_cpu, SCX_DOM_NUMA);

This feels like a lot of code which can just be:

	const struct cpumask *llc_cpus = NULL, *numa_cpus = NULL;

#ifdef CONFIG_SCHED_MC
	llc_cpus = rcu_dereference(per_cpu(sd_llc, cpu));
	numa_cpus = rcu_dereference(per_cpu(sd_numa, cpu));
#endif

>  	s32 cpu;
>  
>  	*found = false;
> @@ -3226,6 +3281,15 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  				goto cpu_found;
>  		}
>  
> +		/*
> +		 * Search for any fully idle core in the same NUMA node.
> +		 */
> +		if (numa_cpus) {
> +			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
> +			if (cpu >= 0)
> +				goto cpu_found;
> +		}

I'm not convinced about the argument that always doing extra pick is
beneficial. Sure, the overhead is minimal but isn't it also trivial to avoid
by just testing llc_cpus == numa_cpus (they resolve to the same cpumasks on
non-NUMA machines, right)? Taking a step further, the topology information
is really static and can be determined during boot. Wouldn't it make more
sense to just skip the unnecessary steps depending on topology? I'm not sure
the difference would be measurable but if you care we can make them
static_keys too.

Thanks.

-- 
tejun

Re: [PATCH] sched_ext: Introduce NUMA awareness to the default idle selection policy

Posted by Andrea Righi 1 month ago

On Thu, Oct 24, 2024 at 09:15:58AM -1000, Tejun Heo wrote:
...
> > @@ -3156,7 +3210,8 @@ static inline const struct cpumask *llc_domain(struct task_struct *p, s32 cpu)
> >  static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >                             u64 wake_flags, bool *found)
> >  {
> > -     const struct cpumask *llc_cpus = llc_domain(p, prev_cpu);
> > +     const struct cpumask *llc_cpus = scx_domain(p, prev_cpu, SCX_DOM_LLC);
> > +     const struct cpumask *numa_cpus = scx_domain(p, prev_cpu, SCX_DOM_NUMA);
> 
> This feels like a lot of code which can just be:
> 
>         const struct cpumask *llc_cpus = NULL, *numa_cpus = NULL;
> 
> #ifdef CONFIG_SCHED_MC
>         llc_cpus = rcu_dereference(per_cpu(sd_llc, cpu));
>         numa_cpus = rcu_dereference(per_cpu(sd_numa, cpu));
> #endif
> 

Yeah, I can definitely simplify this part and get rid of some
boilerplate.

> >       s32 cpu;
> >
> >       *found = false;
> > @@ -3226,6 +3281,15 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> >                               goto cpu_found;
> >               }
> >
> > +             /*
> > +              * Search for any fully idle core in the same NUMA node.
> > +              */
> > +             if (numa_cpus) {
> > +                     cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
> > +                     if (cpu >= 0)
> > +                             goto cpu_found;
> > +             }
> 
> I'm not convinced about the argument that always doing extra pick is
> beneficial. Sure, the overhead is minimal but isn't it also trivial to avoid
> by just testing llc_cpus == numa_cpus (they resolve to the same cpumasks on
> non-NUMA machines, right)? Taking a step further, the topology information
> is really static and can be determined during boot. Wouldn't it make more
> sense to just skip the unnecessary steps depending on topology? I'm not sure
> the difference would be measurable but if you care we can make them
> static_keys too.

Right, on non-NUMA machines llc_cpus and numa_cpus both resolve to the
same CPUs. Also, on systems with a single shared LLC, llc_cpus and
numa_cpus resolve to all CPUs, so in this case we can skip both steps.

Maybe using static_keys is the best, in this way we're sure that we
won't add any overhead in non-NUMA / single-LLC systems compared to the
previous scx_select_cpu_dfl() implementation.

I'll do some tests and send a v2.

Thanks for looking at this!
-Andrea