Cache aware load-balancing

[RFC PATCH v4 25/28] sched: Skip cache aware scheduling if the process has many active threads

Posted by Chen Yu 1 month, 3 weeks ago

If the number of active threads within the process
exceeds the number of Cores(divided by SMTs number)
in the LLC, do not enable cache-aware scheduling.
This is because there is a risk of cache contention
within the preferred LLC when too many threads are
present.

Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2577b4225c3f..4bf794f170cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1205,6 +1205,18 @@ static inline int pref_llc_idx(struct task_struct *p)
 	return llc_idx(p->preferred_llc);
 }
 
+static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
+{
+	int smt_nr = 1;
+
+#ifdef CONFIG_SCHED_SMT
+	if (sched_smt_active())
+		smt_nr = cpumask_weight(cpu_smt_mask(cpu));
+#endif
+
+	return ((mm->nr_running_avg * smt_nr) > per_cpu(sd_llc_size, cpu));
+}
+
 static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
 {
 	int pref_llc;
@@ -1350,7 +1362,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 * it's preferred state.
 	 */
 	if (epoch - READ_ONCE(mm->mm_sched_epoch) > sysctl_llc_old ||
-	    get_nr_threads(p) <= 1) {
+	    get_nr_threads(p) <= 1 ||
+	    exceed_llc_nr(mm, cpu_of(rq))) {
 		mm->mm_sched_cpu = -1;
 		pcpu_sched->occ = 0;
 	}
@@ -1430,6 +1443,11 @@ static void __no_profile task_cache_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	if (get_nr_threads(p) <= 1) {
+		mm->mm_sched_cpu = -1;
+		return;
+	}
+
 	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
 		return;
 
@@ -9095,6 +9113,10 @@ static __maybe_unused enum llc_mig_hint get_migrate_hint(int src_cpu, int dst_cp
 	if (cpu < 0)
 		return mig_allow;
 
+	 /* skip cache aware load balance for single/too many threads */
+	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
+		return mig_allow;
+
 	if (cpus_share_cache(dst_cpu, cpu))
 		return _get_migrate_hint(src_cpu, dst_cpu,
 					 task_util(p), true);
-- 
2.25.1

Re: [RFC PATCH v4 25/28] sched: Skip cache aware scheduling if the process has many active threads

Posted by Tingyin Duan 1 month ago

Several different test cases with mysql and sysbench shows that this patch causes about 10% performance regressions on my computer with 256 cores. Perf-top shows exceed_llc_nr is high. Could you help to address this problems ?

Re: [RFC PATCH v4 25/28] sched: Skip cache aware scheduling if the process has many active threads

Posted by Chen, Yu C 1 month ago

On 9/2/2025 1:16 PM, Tingyin Duan wrote:
> Several different test cases with mysql and sysbench shows that this patch
> causes about 10% performance regressions on my computer with 256 cores.
> Perf-top shows exceed_llc_nr is high. Could you help to address this problems?

Thanks for bringing this to public for discussion. As we synced offline, the
performance regression is likely to be caused by the cache contention 
introduced
by the [25/28] patch:

The 1st issue:Multiple threads within the same process try to read the 
mm_struct->nr_running_avg
while the task_cache_work() modifies the mm_struct->mm_sched_cpu from 
time to time.
Since the mm_sched_cpu and nr_running_avg are in the same cacheline, 
this update
turns the cacheline into "Modified" and the read triggers the costly 
"HITM" event.
We should move nr_running_avg and mm_sched_cpu to different cachelines.

The 2nd issue:
If nr_running_avg remains consistently above the threshold in your test 
case,
exceed_llc_nr() will always return true. This might cause the frequent 
write of -1
to mm->mm_sched_cpu, even if mm->mm_sched_cpu is already -1. This causes 
another
cache contention that threads on other CPUs trying to read the 
mm->mm_sched_cpu.
We should update the mm_struct's mm_sched_cpu field only if the value 
has been
changed.

That is to say, the following patch should fix the regression:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2cca039d6e4f..3c1c50134647 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1032,7 +1032,11 @@ struct mm_struct {
  		raw_spinlock_t mm_sched_lock;
  		unsigned long mm_sched_epoch;
  		int mm_sched_cpu;
-		u64 nr_running_avg;
+		/*
+		 * mm_sched_cpu and nr_running_avg are put into seperate
+		 * cachelines to avoid cache contention.
+		 */
+		u64 nr_running_avg ____cacheline_aligned_in_smp;
  #endif

  #ifdef CONFIG_MMU
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 026013c826d9..4ef28db57a37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1428,7 +1428,8 @@ void account_mm_sched(struct rq *rq, struct 
task_struct *p, s64 delta_exec)
  	    get_nr_threads(p) <= 1 ||
  	    exceed_llc_nr(mm, cpu_of(rq)) ||
  	    exceed_llc_capacity(mm, cpu_of(rq))) {
-		mm->mm_sched_cpu = -1;
+		if (mm->mm_sched_cpu != -1)
+			mm->mm_sched_cpu = -1;
  		pcpu_sched->occ = 0;
  	}

-- 
2.25.1

With above change, the regression I mentioned in the cover letter, when
running multiple instances of hackbench on AMD Milan has disappeared.
And max latency improvement of sysbench+MariaDB are observed on Milan:
transactions per sec.: -0.72%
min latency: -0.00%
avg latency: -0.00%
max latency: +78.90%
95th percentile: -0.00%
events avg: -0.72%
events stddev: +50.72%

thanks,
Chenyu

Re: [RFC PATCH v4 25/28] sched: Skip cache aware scheduling if the process has many active threads

Posted by Duan Tingyin 1 month ago

After applying this patch, the performance regression has been significantly improved. Thank you!

Re: [RFC PATCH v4 25/28] sched: Skip cache aware scheduling if the process has many active threads

Posted by Tingyin Duan 1 month ago