[PATCH v3] sched_ext: Introduce LLC awareness to the default idle selection policy

Andrea Righi posted 1 patch 1 month ago
kernel/sched/ext.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 60 insertions(+)
[PATCH v3] sched_ext: Introduce LLC awareness to the default idle selection policy
Posted by Andrea Righi 1 month ago
Rely on the scheduler topology information to implement basic LLC
awareness in the sched_ext build-in idle selection policy.

This allows schedulers using the built-in policy to make more informed
decisions when selecting an idle CPU in systems with multiple LLCs, such
as NUMA systems or chiplet-based architectures, and it helps keep tasks
within the same LLC domain, thereby improving cache locality.

For efficiency, LLC awareness is applied only to tasks that can run on
all the CPUs in the system for now. If a task's affinity is modified
from user space, it's the responsibility of user space to choose the
appropriate optimized scheduling domain.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

ChangeLog v2 -> v3:
  - skip LLC-aware idle picking if not all CPUs are allowed by the task
    and get rid of the expensive cpumask ops
  - add a comment to clarify that ops.select_cpu() is never called with
    per-CPU tasks

ChangeLog v1 -> v2:
  - get rid of expensive cpumask_copy()
  - depend on CONFIG_SCHED_MC (there is no point enabling llc awareness
    if the kernel doesn't keep track of llc information)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a13a6461a290..34a20cdcec5d 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3119,9 +3119,39 @@ static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
 		goto retry;
 }
 
+#ifdef CONFIG_SCHED_MC
+/*
+ * Return the cpumask of CPUs usable by task @p in the same LLC domain of @cpu,
+ * or NULL if the LLC domain cannot be determined.
+ */
+static const struct cpumask *llc_domain(const struct task_struct *p, s32 cpu)
+{
+	struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
+	const struct cpumask *llc_cpus = sd ? sched_domain_span(sd) : NULL;
+
+	/*
+	 * Return the LLC domain only if the task is allowed to run on all
+	 * CPUs.
+	 */
+	return p->nr_cpus_allowed == nr_cpu_ids ? llc_cpus : NULL;
+}
+#else /* CONFIG_SCHED_MC */
+static inline const struct cpumask *llc_domain(struct task_struct *p, s32 cpu)
+{
+	return NULL;
+}
+#endif /* CONFIG_SCHED_MC */
+
+/*
+ * Built-in cpu idle selection policy.
+ *
+ * NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
+ * we never call ops.select_cpu() for them, see select_task_rq().
+ */
 static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 			      u64 wake_flags, bool *found)
 {
+	const struct cpumask *llc_cpus = llc_domain(p, prev_cpu);
 	s32 cpu;
 
 	*found = false;
@@ -3173,22 +3203,52 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
 	 * partially idle @prev_cpu.
 	 */
 	if (sched_smt_active()) {
+		/*
+		 * Keep using @prev_cpu if it's part of a fully idle core.
+		 */
 		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
 		    test_and_clear_cpu_idle(prev_cpu)) {
 			cpu = prev_cpu;
 			goto cpu_found;
 		}
 
+		/*
+		 * Search for any fully idle core in the same LLC domain.
+		 */
+		if (llc_cpus) {
+			cpu = scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE);
+			if (cpu >= 0)
+				goto cpu_found;
+		}
+
+		/*
+		 * Search for any full idle core usable by the task.
+		 */
 		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
 		if (cpu >= 0)
 			goto cpu_found;
 	}
 
+	/*
+	 * Use @prev_cpu if it's idle.
+	 */
 	if (test_and_clear_cpu_idle(prev_cpu)) {
 		cpu = prev_cpu;
 		goto cpu_found;
 	}
 
+	/*
+	 * Search for any idle CPU in the same LLC domain.
+	 */
+	if (llc_cpus) {
+		cpu = scx_pick_idle_cpu(llc_cpus, 0);
+		if (cpu >= 0)
+			goto cpu_found;
+	}
+
+	/*
+	 * Search for any idle CPU usable by the task.
+	 */
 	cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
 	if (cpu >= 0)
 		goto cpu_found;
-- 
2.47.0
Re: [PATCH v3] sched_ext: Introduce LLC awareness to the default idle selection policy
Posted by Tejun Heo 1 month ago
On Wed, Oct 23, 2024 at 01:47:18AM +0200, Andrea Righi wrote:
> Rely on the scheduler topology information to implement basic LLC
> awareness in the sched_ext build-in idle selection policy.
> 
> This allows schedulers using the built-in policy to make more informed
> decisions when selecting an idle CPU in systems with multiple LLCs, such
> as NUMA systems or chiplet-based architectures, and it helps keep tasks
> within the same LLC domain, thereby improving cache locality.
> 
> For efficiency, LLC awareness is applied only to tasks that can run on
> all the CPUs in the system for now. If a task's affinity is modified
> from user space, it's the responsibility of user space to choose the
> appropriate optimized scheduling domain.
> 
> Signed-off-by: Andrea Righi <arighi@nvidia.com>

Applied to sched_ext/for-6.13.

Thanks.

-- 
tejun