Cache aware load-balancing

[RFC PATCH v4 07/28] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Chen Yu 1 month, 3 weeks ago

From: Tim Chen <tim.c.chen@linux.intel.com>

Cache-aware scheduling is designed to aggregate threads into their
preferred LLC, either via the task wake up path or the load balancing
path. One side effect is that when the preferred LLC is saturated,
more threads will continue to be stacked on it, degrading the workload's
latency. A strategy is needed to prevent this aggregation from going too
far such that the preferred LLC is too overloaded.

Introduce helper function _get_migrate_hint() to implement the
LLC migration policy:

1) A task is aggregated to its preferred LLC if both source/dest LLC
   are not too busy (<50% utilization, tunable), or the preferred
   LLC will not be too out of balanced from the non preferred LLC
   (>20% utilization, tunable, close to imbalance_pct of the LLC
   domain).
2) Allow a task to be moved from the preferred LLC to the
   non-preferred one if the non-preferred LLC will not be too out
   of balanced from the preferred prompting an aggregation task
   migration later.  We are still experimenting with the aggregation
   and migration policy. Some other possibilities are policy based
   on LLC's load or average number of tasks running.  Those could
   be tried out by tweaking _get_migrate_hint().

The function _get_migrate_hint() returns migration suggestions for
the upper-level functions.

Aggregation will tend to make utilization on the preferred LLC to
be more than the non-preferred one. Parameter "sysctl_llc_aggr_imb"
is the imbalance allowed. If it is set to 0, as long as the preferred
LLC is not utilized more than the source LLC, we could still aggregate
towards the preferred LLC and a preference could still be there.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/debug.c |   4 ++
 kernel/sched/fair.c  | 110 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   5 ++
 3 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 557246880a7e..682fd91a42a0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -532,6 +532,10 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
 #endif
 
+#ifdef CONFIG_SCHED_CACHE
+	debugfs_create_u32("llc_aggr_cap", 0644, debugfs_sched, &sysctl_llc_aggr_cap);
+	debugfs_create_u32("llc_aggr_imb", 0644, debugfs_sched, &sysctl_llc_aggr_imb);
+#endif
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
 	debugfs_fair_server_init();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4f79b7652642..3128dbcf0a36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8826,7 +8826,39 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 }
 
 #ifdef CONFIG_SCHED_CACHE
-static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
+static long __migrate_degrades_locality(struct task_struct *p,
+					int src_cpu, int dst_cpu,
+					bool idle);
+__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
+__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
+
+/*
+ * The margin used when comparing LLC utilization with CPU capacity.
+ * Parameter sysctl_llc_aggr_cap determines the LLC load level where
+ * active LLC aggregation is done.
+ * Derived from fits_capacity().
+ *
+ * (default: ~50%)
+ */
+#define fits_llc_capacity(util, max)	\
+	((util) * 100 < (max) * sysctl_llc_aggr_cap)
+
+/*
+ * The margin used when comparing utilization.
+ * is 'util1' noticeably greater than 'util2'
+ * Derived from capacity_greater().
+ * Bias is in perentage.
+ */
+/* Allows dst util to be bigger than src util by up to bias percent */
+#define util_greater(util1, util2) \
+	((util1) * 100 > (util2) * (100 + sysctl_llc_aggr_imb))
+
+enum llc_mig_hint {
+	mig_allow = 0,
+	mig_ignore,
+	mig_forbid
+};
+
 
 /* expected to be protected by rcu_read_lock() */
 static bool get_llc_stats(int cpu, unsigned long *util,
@@ -8844,6 +8876,82 @@ static bool get_llc_stats(int cpu, unsigned long *util,
 	return true;
 }
 
+static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
+					   unsigned long tsk_util,
+					   bool to_pref)
+{
+	unsigned long src_util, dst_util, src_cap, dst_cap;
+
+	if (cpus_share_cache(src_cpu, dst_cpu))
+		return mig_allow;
+
+	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
+	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
+		return mig_ignore;
+
+	if (!fits_llc_capacity(dst_util, dst_cap) &&
+	    !fits_llc_capacity(src_util, src_cap))
+		return mig_ignore;
+
+	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
+	dst_util = dst_util + tsk_util;
+	if (to_pref) {
+		/*
+		 * sysctl_llc_aggr_imb is the imbalance allowed between
+		 * preferred LLC and non-preferred LLC.
+		 * Don't migrate if we will get preferred LLC too
+		 * heavily loaded and if the dest is much busier
+		 * than the src, in which case migration will
+		 * increase the imbalance too much.
+		 */
+		if (!fits_llc_capacity(dst_util, dst_cap) &&
+		    util_greater(dst_util, src_util))
+			return mig_forbid;
+	} else {
+		/*
+		 * Don't migrate if we will leave preferred LLC
+		 * too idle, or if this migration leads to the
+		 * non-preferred LLC falls within sysctl_aggr_imb percent
+		 * of preferred LLC, leading to migration again
+		 * back to preferred LLC.
+		 */
+		if (fits_llc_capacity(src_util, src_cap) ||
+		    !util_greater(src_util, dst_util))
+			return mig_forbid;
+	}
+	return mig_allow;
+}
+
+/*
+ * Give suggestion when task p is migrated from src_cpu to dst_cpu.
+ */
+static __maybe_unused enum llc_mig_hint get_migrate_hint(int src_cpu, int dst_cpu,
+							 struct task_struct *p)
+{
+	struct mm_struct *mm;
+	int cpu;
+
+	if (cpus_share_cache(src_cpu, dst_cpu))
+		return mig_allow;
+
+	mm = p->mm;
+	if (!mm)
+		return mig_allow;
+
+	cpu = mm->mm_sched_cpu;
+	if (cpu < 0)
+		return mig_allow;
+
+	if (cpus_share_cache(dst_cpu, cpu))
+		return _get_migrate_hint(src_cpu, dst_cpu,
+					 task_util(p), true);
+	else if (cpus_share_cache(src_cpu, cpu))
+		return _get_migrate_hint(src_cpu, dst_cpu,
+					 task_util(p), false);
+	else
+		return mig_allow;
+}
+
 static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct mm_struct *mm = p->mm;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f4ab45ecca86..83552aab74fb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2844,6 +2844,11 @@ extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_hot_threshold;
 
+#ifdef CONFIG_SCHED_CACHE
+extern unsigned int sysctl_llc_aggr_cap;
+extern unsigned int sysctl_llc_aggr_imb;
+#endif
+
 #ifdef CONFIG_SCHED_HRTICK
 
 /*
-- 
2.25.1

Re: [RFC PATCH v4 07/28] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Peter Zijlstra 3 days, 14 hours ago

On Sat, Aug 09, 2025 at 01:03:10PM +0800, Chen Yu wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>
> 
> Cache-aware scheduling is designed to aggregate threads into their
> preferred LLC, either via the task wake up path or the load balancing
> path. One side effect is that when the preferred LLC is saturated,
> more threads will continue to be stacked on it, degrading the workload's
> latency. A strategy is needed to prevent this aggregation from going too
> far such that the preferred LLC is too overloaded.

So one of the ideas was to extend the preferred llc number to a mask.
Update the preferred mask with (nr_threads / llc_size) bits, indicating
the that many top llc as sorted by occupancy.

Re: [RFC PATCH v4 07/28] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Chen, Yu C 2 days, 16 hours ago

On 10/1/2025 9:17 PM, Peter Zijlstra wrote:
> On Sat, Aug 09, 2025 at 01:03:10PM +0800, Chen Yu wrote:
>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>
>> Cache-aware scheduling is designed to aggregate threads into their
>> preferred LLC, either via the task wake up path or the load balancing
>> path. One side effect is that when the preferred LLC is saturated,
>> more threads will continue to be stacked on it, degrading the workload's
>> latency. A strategy is needed to prevent this aggregation from going too
>> far such that the preferred LLC is too overloaded.
> 
> So one of the ideas was to extend the preferred llc number to a mask.
> Update the preferred mask with (nr_threads / llc_size) bits, indicating
> the that many top llc as sorted by occupancy.
> 
> 

Having more than one preferred LLC helps prevent aggregation from going
too far on a single preferred LLC.

One question would be: if one LLC cannot hold all the threads of a process,
does a second preferred LLC help in this use case? Currently, this patch
gives up task aggregation and falls back to legacy load balancing if the
preferred LLC is overloaded. If we place threads across two preferred LLCs,
these threads might encounter cross-LLC latency anyway - so we may as 
well let
legacy load balancing spread them out IMO.

Another issue that Patch 7 tries to address is avoiding task
bouncing between preferred LLCs and non-preferred LLCs. If we
introduce a preferred LLC priority list, logic to prevent task
bouncing between different preferred LLCs might be needed in
load balancing, which could become complicated. Currently, we
mainly implement cache-aware scheduling in load balancing rather
than during task wakeup, because the wakeup path conflicts with
the load balance path and causes task migration bouncing.

thanks,
Chenyu

Re: [RFC PATCH v4 07/28] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Peter Zijlstra 2 days, 15 hours ago

On Thu, Oct 02, 2025 at 07:31:40PM +0800, Chen, Yu C wrote:
> On 10/1/2025 9:17 PM, Peter Zijlstra wrote:
> > On Sat, Aug 09, 2025 at 01:03:10PM +0800, Chen Yu wrote:
> > > From: Tim Chen <tim.c.chen@linux.intel.com>
> > > 
> > > Cache-aware scheduling is designed to aggregate threads into their
> > > preferred LLC, either via the task wake up path or the load balancing
> > > path. One side effect is that when the preferred LLC is saturated,
> > > more threads will continue to be stacked on it, degrading the workload's
> > > latency. A strategy is needed to prevent this aggregation from going too
> > > far such that the preferred LLC is too overloaded.
> > 
> > So one of the ideas was to extend the preferred llc number to a mask.
> > Update the preferred mask with (nr_threads / llc_size) bits, indicating
> > the that many top llc as sorted by occupancy.
> > 
> > 
> 
> Having more than one preferred LLC helps prevent aggregation from going
> too far on a single preferred LLC.
> 
> One question would be: if one LLC cannot hold all the threads of a process,
> does a second preferred LLC help in this use case? Currently, this patch
> gives up task aggregation and falls back to legacy load balancing if the
> preferred LLC is overloaded. If we place threads across two preferred LLCs,
> these threads might encounter cross-LLC latency anyway - so we may as well
> let
> legacy load balancing spread them out IMO.

Well, being stuck on 2 LLCs instead of being spread across 10 still
seems like a win, no?

Remember, our friends at AMD have *MANY* LLCs.

> Another issue that Patch 7 tries to address is avoiding task
> bouncing between preferred LLCs and non-preferred LLCs. If we
> introduce a preferred LLC priority list, logic to prevent task
> bouncing between different preferred LLCs might be needed in
> load balancing, which could become complicated. 

It doesn't really become more difficult to tell preferred LLC from
non-preferred LLC with a asm. So why should things get more complicatd?


Anyway, it was just one of the 'random' ideas I had kicking about.
Reality always ruins things, *shrug* :-)

Re: [RFC PATCH v4 07/28] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Tim Chen 2 days, 9 hours ago

On Thu, 2025-10-02 at 13:50 +0200, Peter Zijlstra wrote:
> On Thu, Oct 02, 2025 at 07:31:40PM +0800, Chen, Yu C wrote:
> > On 10/1/2025 9:17 PM, Peter Zijlstra wrote:
> > > On Sat, Aug 09, 2025 at 01:03:10PM +0800, Chen Yu wrote:
> > > > From: Tim Chen <tim.c.chen@linux.intel.com>
> > > > 
> > > > Cache-aware scheduling is designed to aggregate threads into their
> > > > preferred LLC, either via the task wake up path or the load balancing
> > > > path. One side effect is that when the preferred LLC is saturated,
> > > > more threads will continue to be stacked on it, degrading the workload's
> > > > latency. A strategy is needed to prevent this aggregation from going too
> > > > far such that the preferred LLC is too overloaded.
> > > 
> > > So one of the ideas was to extend the preferred llc number to a mask.
> > > Update the preferred mask with (nr_threads / llc_size) bits, indicating
> > > the that many top llc as sorted by occupancy.
> > > 
> > > 
> > 
> > Having more than one preferred LLC helps prevent aggregation from going
> > too far on a single preferred LLC.
> > 
> > One question would be: if one LLC cannot hold all the threads of a process,
> > does a second preferred LLC help in this use case? Currently, this patch
> > gives up task aggregation and falls back to legacy load balancing if the
> > preferred LLC is overloaded. If we place threads across two preferred LLCs,
> > these threads might encounter cross-LLC latency anyway - so we may as well
> > let
> > legacy load balancing spread them out IMO.
> 
> Well, being stuck on 2 LLCs instead of being spread across 10 still
> seems like a win, no?
> 
> Remember, our friends at AMD have *MANY* LLCs.
> 
> > Another issue that Patch 7 tries to address is avoiding task
> > bouncing between preferred LLCs and non-preferred LLCs. If we
> > introduce a preferred LLC priority list, logic to prevent task
> > bouncing between different preferred LLCs might be needed in
> > load balancing, which could become complicated. 
> 
> It doesn't really become more difficult to tell preferred LLC from
> non-preferred LLC with a asm. So why should things get more complicatd?
> 

For secondary and maybe tertiary LLCs to work well, the
ordering of the occupancy between the LLCs have to be
relatively stable. Or else we could have many
tasks migration between the LLCs when the ordering change.
Frequent task migrations could be worse for performance.

From previous experiments, we saw that the occupancy could
have some fairly big fluctuations.  That's the reason 
we set the preferred LLC threshold to be high (2x).
We want to be sure before jerking tasks around to a new LLC.

With the secondary, tertiary LLCs, LLC ordering would change
more frequently than having just a single preferred LLC.
The secondary and tertiary LLCs have fewer tasks/mm and
occupancy could fluctuate more.
One concern is this could lead to extra task migrations
that could negate any cache consolidation benefits gained.

Tim

> 
> Anyway, it was just one of the 'random' ideas I had kicking about.
> Reality always ruins things, *shrug* :-)

Re: [RFC PATCH v4 07/28] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Chen, Yu C 2 days, 14 hours ago

On 10/2/2025 7:50 PM, Peter Zijlstra wrote:
> On Thu, Oct 02, 2025 at 07:31:40PM +0800, Chen, Yu C wrote:
>> On 10/1/2025 9:17 PM, Peter Zijlstra wrote:
>>> On Sat, Aug 09, 2025 at 01:03:10PM +0800, Chen Yu wrote:
>>>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>>>
>>>> Cache-aware scheduling is designed to aggregate threads into their
>>>> preferred LLC, either via the task wake up path or the load balancing
>>>> path. One side effect is that when the preferred LLC is saturated,
>>>> more threads will continue to be stacked on it, degrading the workload's
>>>> latency. A strategy is needed to prevent this aggregation from going too
>>>> far such that the preferred LLC is too overloaded.
>>>
>>> So one of the ideas was to extend the preferred llc number to a mask.
>>> Update the preferred mask with (nr_threads / llc_size) bits, indicating
>>> the that many top llc as sorted by occupancy.
>>>
>>>
>>
>> Having more than one preferred LLC helps prevent aggregation from going
>> too far on a single preferred LLC.
>>
>> One question would be: if one LLC cannot hold all the threads of a process,
>> does a second preferred LLC help in this use case? Currently, this patch
>> gives up task aggregation and falls back to legacy load balancing if the
>> preferred LLC is overloaded. If we place threads across two preferred LLCs,
>> these threads might encounter cross-LLC latency anyway - so we may as well
>> let
>> legacy load balancing spread them out IMO.
> 
> Well, being stuck on 2 LLCs instead of being spread across 10 still
> seems like a win, no?
> 
> Remember, our friends at AMD have *MANY* LLCs.
> 

I see, this makes sense.

>> Another issue that Patch 7 tries to address is avoiding task
>> bouncing between preferred LLCs and non-preferred LLCs. If we
>> introduce a preferred LLC priority list, logic to prevent task
>> bouncing between different preferred LLCs might be needed in
>> load balancing, which could become complicated.
> 
> It doesn't really become more difficult to tell preferred LLC from
> non-preferred LLC with a asm. So why should things get more complicatd?
> 

Besides distinguishing between preferred LLCs and non-preferred LLCs,
we might also want to distinguish between the ith preferred LLC and
the jth preferred LLC—this would help avoid task migration
bouncing between them by using hysteresis.

> 
> Anyway, it was just one of the 'random' ideas I had kicking about.
> Reality always ruins things, *shrug* :-)

Yes, multiple preferred LLCs is a promising direction if the data
shows good results. We are planning to clean up the RFC patch and
send a refreshed version to summarize the current status. Meanwhile,
we will evaluate the multi-preferred LLC approach internally.
Thanks for providing this idea.

thanks,
Chenyu