Cache aware scheduling

[RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Tim Chen 3 months, 3 weeks ago

Cache-aware scheduling is designed to aggregate threads into their
preferred LLC, either via the task wake up path or the load balancing
path. One side effect is that when the preferred LLC is saturated,
more threads will continue to be stacked on it, degrading the workload's
latency. A strategy is needed to prevent this aggregation from going too
far such that the preferred LLC is too overloaded.

Introduce helper function _get_migrate_hint() to implement the LLC
migration policy:

1) A task is aggregated to its preferred LLC if both source/dest LLC
   are not too busy (<50% utilization, tunable), or the preferred
   LLC will not be too out of balanced from the non preferred LLC
   (>20% utilization, tunable, close to imbalance_pct of the LLC
   domain).
2) Allow a task to be moved from the preferred LLC to the
   non-preferred one if the non-preferred LLC will not be too out
   of balanced from the preferred prompting an aggregation task
   migration later.  We are still experimenting with the aggregation
   and migration policy. Some other possibilities are policy based
   on LLC's load or average number of tasks running.  Those could
   be tried out by tweaking _get_migrate_hint().

The function _get_migrate_hint() returns migration suggestions for the upper-level
functions.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/debug.c |   4 ++
 kernel/sched/fair.c  | 110 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |   5 ++
 3 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 56ae54e0ce6a..7271ad1152af 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -532,6 +532,10 @@ static __init int sched_init_debug(void)
 	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
 #endif
 
+#ifdef CONFIG_SCHED_CACHE
+	debugfs_create_u32("llc_aggr_cap", 0644, debugfs_sched, &sysctl_llc_aggr_cap);
+	debugfs_create_u32("llc_aggr_imb", 0644, debugfs_sched, &sysctl_llc_aggr_imb);
+#endif
 	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
 
 	debugfs_fair_server_init();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f104414b9a..10ea408d0e40 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8804,7 +8804,39 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 }
 
 #ifdef CONFIG_SCHED_CACHE
-static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
+static long __migrate_degrades_locality(struct task_struct *p,
+					int src_cpu, int dst_cpu,
+					bool idle);
+__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
+__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
+
+/*
+ * The margin used when comparing LLC utilization with CPU capacity.
+ * Parameter sysctl_llc_aggr_cap determines the LLC load level where
+ * active LLC aggregation is done.
+ * Derived from fits_capacity().
+ *
+ * (default: ~50%)
+ */
+#define fits_llc_capacity(util, max)	\
+	((util) * 100 < (max) * sysctl_llc_aggr_cap)
+
+/*
+ * The margin used when comparing utilization.
+ * is 'util1' noticeably greater than 'util2'
+ * Derived from capacity_greater().
+ * Bias is in perentage.
+ */
+/* Allows dst util to be bigger than src util by up to bias percent */
+#define util_greater(util1, util2) \
+	((util1) * 100 > (util2) * (100 + sysctl_llc_aggr_imb))
+
+enum llc_mig_hint {
+	mig_allow = 0,
+	mig_ignore,
+	mig_forbid
+};
+
 
 /* expected to be protected by rcu_read_lock() */
 static bool get_llc_stats(int cpu, unsigned long *util,
@@ -8822,6 +8854,82 @@ static bool get_llc_stats(int cpu, unsigned long *util,
 	return true;
 }
 
+static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
+					   unsigned long tsk_util,
+					   bool to_pref)
+{
+	unsigned long src_util, dst_util, src_cap, dst_cap;
+
+	if (cpus_share_cache(src_cpu, dst_cpu))
+		return mig_allow;
+
+	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
+	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
+		return mig_allow;
+
+	if (!fits_llc_capacity(dst_util, dst_cap) &&
+	    !fits_llc_capacity(src_util, src_cap))
+		return mig_ignore;
+
+	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
+	dst_util = dst_util + tsk_util;
+	if (to_pref) {
+		/*
+		 * sysctl_llc_aggr_imb is the imbalance allowed between
+		 * preferred LLC and non-preferred LLC.
+		 * Don't migrate if we will get preferred LLC too
+		 * heavily loaded and if the dest is much busier
+		 * than the src, in which case migration will
+		 * increase the imbalance too much.
+		 */
+		if (!fits_llc_capacity(dst_util, dst_cap) &&
+		    util_greater(dst_util, src_util))
+			return mig_forbid;
+	} else {
+		/*
+		 * Don't migrate if we will leave preferred LLC
+		 * too idle, or if this migration leads to the
+		 * non-preferred LLC falls within sysctl_aggr_imb percent
+		 * of preferred LLC, leading to migration again
+		 * back to preferred LLC.
+		 */
+		if (fits_llc_capacity(src_util, src_cap) ||
+		    !util_greater(src_util, dst_util))
+			return mig_forbid;
+	}
+	return mig_allow;
+}
+
+/*
+ * Give suggestion when task p is migrated from src_cpu to dst_cpu.
+ */
+static __maybe_unused enum llc_mig_hint get_migrate_hint(int src_cpu, int dst_cpu,
+							 struct task_struct *p)
+{
+	struct mm_struct *mm;
+	int cpu;
+
+	if (cpus_share_cache(src_cpu, dst_cpu))
+		return mig_allow;
+
+	mm = p->mm;
+	if (!mm)
+		return mig_allow;
+
+	cpu = mm->mm_sched_cpu;
+	if (cpu < 0)
+		return mig_allow;
+
+	if (cpus_share_cache(dst_cpu, cpu))
+		return _get_migrate_hint(src_cpu, dst_cpu,
+					 task_util(p), true);
+	else if (cpus_share_cache(src_cpu, cpu))
+		return _get_migrate_hint(src_cpu, dst_cpu,
+					 task_util(p), false);
+	else
+		return mig_allow;
+}
+
 static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct mm_struct *mm = p->mm;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d16ccd66ca07..1c6fd45c7f62 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2818,6 +2818,11 @@ extern unsigned int sysctl_numa_balancing_scan_period_max;
 extern unsigned int sysctl_numa_balancing_scan_size;
 extern unsigned int sysctl_numa_balancing_hot_threshold;
 
+#ifdef CONFIG_SCHED_CACHE
+extern unsigned int sysctl_llc_aggr_cap;
+extern unsigned int sysctl_llc_aggr_imb;
+#endif
+
 #ifdef CONFIG_SCHED_HRTICK
 
 /*
-- 
2.32.0

Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Libo Chen 3 months ago

Hi Tim and Chenyu,


On 6/18/25 11:27, Tim Chen wrote:
> Cache-aware scheduling is designed to aggregate threads into their
> preferred LLC, either via the task wake up path or the load balancing
> path. One side effect is that when the preferred LLC is saturated,
> more threads will continue to be stacked on it, degrading the workload's
> latency. A strategy is needed to prevent this aggregation from going too
> far such that the preferred LLC is too overloaded.
> 
> Introduce helper function _get_migrate_hint() to implement the LLC
> migration policy:
> 
> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>    are not too busy (<50% utilization, tunable), or the preferred
>    LLC will not be too out of balanced from the non preferred LLC
>    (>20% utilization, tunable, close to imbalance_pct of the LLC
>    domain).
> 2) Allow a task to be moved from the preferred LLC to the
>    non-preferred one if the non-preferred LLC will not be too out
>    of balanced from the preferred prompting an aggregation task
>    migration later.  We are still experimenting with the aggregation
>    and migration policy. Some other possibilities are policy based
>    on LLC's load or average number of tasks running.  Those could
>    be tried out by tweaking _get_migrate_hint().
> 
> The function _get_migrate_hint() returns migration suggestions for the upper-le
> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
> +


I think this patch has a great potential.

Since _get_migrate_hint() is tied to an individual task anyway, why not add a
per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
preferences for llc stacking, they can all be running in the same system at the
same time. This way you can offer a greater deal of optimization without much
burden to others.

Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE? Does setting
sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?

Thanks,
Libo

> +static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
> +					   unsigned long tsk_util,
> +					   bool to_pref)
> +{
> +	unsigned long src_util, dst_util, src_cap, dst_cap;
> +
> +	if (cpus_share_cache(src_cpu, dst_cpu))
> +		return mig_allow;
> +
> +	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
> +	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
> +		return mig_allow;
> +
> +	if (!fits_llc_capacity(dst_util, dst_cap) &&
> +	    !fits_llc_capacity(src_util, src_cap))
> +		return mig_ignore;
> +
> +	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
> +	dst_util = dst_util + tsk_util;
> +	if (to_pref) {
> +		/*
> +		 * sysctl_llc_aggr_imb is the imbalance allowed between
> +		 * preferred LLC and non-preferred LLC.
> +		 * Don't migrate if we will get preferred LLC too
> +		 * heavily loaded and if the dest is much busier
> +		 * than the src, in which case migration will
> +		 * increase the imbalance too much.
> +		 */
> +		if (!fits_llc_capacity(dst_util, dst_cap) &&
> +		    util_greater(dst_util, src_util))
> +			return mig_forbid;
> +	} else {
> +		/*
> +		 * Don't migrate if we will leave preferred LLC
> +		 * too idle, or if this migration leads to the
> +		 * non-preferred LLC falls within sysctl_aggr_imb percent
> +		 * of preferred LLC, leading to migration again
> +		 * back to preferred LLC.
> +		 */
> +		if (fits_llc_capacity(src_util, src_cap) ||
> +		    !util_greater(src_util, dst_util))
> +			return mig_forbid;
> +	}
> +	return mig_allow;
> +}

Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Tim Chen 3 months ago

On Mon, 2025-07-07 at 17:41 -0700, Libo Chen wrote:
> Hi Tim and Chenyu,
> 
> 
> On 6/18/25 11:27, Tim Chen wrote:
> > Cache-aware scheduling is designed to aggregate threads into their
> > preferred LLC, either via the task wake up path or the load balancing
> > path. One side effect is that when the preferred LLC is saturated,
> > more threads will continue to be stacked on it, degrading the workload's
> > latency. A strategy is needed to prevent this aggregation from going too
> > far such that the preferred LLC is too overloaded.
> > 
> > Introduce helper function _get_migrate_hint() to implement the LLC
> > migration policy:
> > 
> > 1) A task is aggregated to its preferred LLC if both source/dest LLC
> >    are not too busy (<50% utilization, tunable), or the preferred
> >    LLC will not be too out of balanced from the non preferred LLC
> >    (>20% utilization, tunable, close to imbalance_pct of the LLC
> >    domain).
> > 2) Allow a task to be moved from the preferred LLC to the
> >    non-preferred one if the non-preferred LLC will not be too out
> >    of balanced from the preferred prompting an aggregation task
> >    migration later.  We are still experimenting with the aggregation
> >    and migration policy. Some other possibilities are policy based
> >    on LLC's load or average number of tasks running.  Those could
> >    be tried out by tweaking _get_migrate_hint().
> > 
> > The function _get_migrate_hint() returns migration suggestions for the upper-le
> > +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
> > +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
> > +
> 
> 
> I think this patch has a great potential.
> 

Thanks for taking a look.

> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
> per-task llc_aggr_imb which defaults to the sysctl one? 
> 

_get_migrate_hint() could also be called from llc_balance(). At that time
we make a determination of whether we should do llc_balance() without knowing
which exact task we're going to move, but still observe the migration policy
that shouldn't cause too much imbalance.  So it may not be strictly tied to a task
in the current implementation.

> Tasks have different
> preferences for llc stacking, they can all be running in the same system at the
> same time. This way you can offer a greater deal of optimization without much
> burden to others.

You're thinking of something like a prctl knob that will bias aggregation for
some process?  Wonder if Peter has some opinion on this.

> 
> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE?
> 

Actually we think that we can do without SCHED_CACHE_WAKE feature and rely only
on load balance SCHED_CACHE_LB.  But still keeping 

>  Does setting
> sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?

Aggregation will tend to make utilization on the preferred LLC to be more
than the non-preferred one.  Parameter "sysctl_llc_aggr_imb" is the imbalance
allowed.  If we set this to 0, as long as the preferred LLC is not utilized
more than the source LLC, we could still aggregate towards the preferred LLC
and a preference could still be there.  

Tim

> 
> Thanks,
> Libo
> 
> > +static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
> > +					   unsigned long tsk_util,
> > +					   bool to_pref)
> > +{
> > +	unsigned long src_util, dst_util, src_cap, dst_cap;
> > +
> > +	if (cpus_share_cache(src_cpu, dst_cpu))
> > +		return mig_allow;
> > +
> > +	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
> > +	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
> > +		return mig_allow;
> > +
> > +	if (!fits_llc_capacity(dst_util, dst_cap) &&
> > +	    !fits_llc_capacity(src_util, src_cap))
> > +		return mig_ignore;
> > +
> > +	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
> > +	dst_util = dst_util + tsk_util;
> > +	if (to_pref) {
> > +		/*
> > +		 * sysctl_llc_aggr_imb is the imbalance allowed between
> > +		 * preferred LLC and non-preferred LLC.
> > +		 * Don't migrate if we will get preferred LLC too
> > +		 * heavily loaded and if the dest is much busier
> > +		 * than the src, in which case migration will
> > +		 * increase the imbalance too much.
> > +		 */
> > +		if (!fits_llc_capacity(dst_util, dst_cap) &&
> > +		    util_greater(dst_util, src_util))
> > +			return mig_forbid;
> > +	} else {
> > +		/*
> > +		 * Don't migrate if we will leave preferred LLC
> > +		 * too idle, or if this migration leads to the
> > +		 * non-preferred LLC falls within sysctl_aggr_imb percent
> > +		 * of preferred LLC, leading to migration again
> > +		 * back to preferred LLC.
> > +		 */
> > +		if (fits_llc_capacity(src_util, src_cap) ||
> > +		    !util_greater(src_util, dst_util))
> > +			return mig_forbid;
> > +	}
> > +	return mig_allow;
> > +}
> 
>

Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Libo Chen 3 months ago


On 7/8/25 14:59, Tim Chen wrote:
> On Mon, 2025-07-07 at 17:41 -0700, Libo Chen wrote:
>> Hi Tim and Chenyu,
>>
>>
>> On 6/18/25 11:27, Tim Chen wrote:
>>> Cache-aware scheduling is designed to aggregate threads into their
>>> preferred LLC, either via the task wake up path or the load balancing
>>> path. One side effect is that when the preferred LLC is saturated,
>>> more threads will continue to be stacked on it, degrading the workload's
>>> latency. A strategy is needed to prevent this aggregation from going too
>>> far such that the preferred LLC is too overloaded.
>>>
>>> Introduce helper function _get_migrate_hint() to implement the LLC
>>> migration policy:
>>>
>>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>>    are not too busy (<50% utilization, tunable), or the preferred
>>>    LLC will not be too out of balanced from the non preferred LLC
>>>    (>20% utilization, tunable, close to imbalance_pct of the LLC
>>>    domain).
>>> 2) Allow a task to be moved from the preferred LLC to the
>>>    non-preferred one if the non-preferred LLC will not be too out
>>>    of balanced from the preferred prompting an aggregation task
>>>    migration later.  We are still experimenting with the aggregation
>>>    and migration policy. Some other possibilities are policy based
>>>    on LLC's load or average number of tasks running.  Those could
>>>    be tried out by tweaking _get_migrate_hint().
>>>
>>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>>> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
>>> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
>>> +
>>
>>
>> I think this patch has a great potential.
>>
> 
> Thanks for taking a look.
> 
>> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
>> per-task llc_aggr_imb which defaults to the sysctl one? 
>>
> 
> _get_migrate_hint() could also be called from llc_balance(). At that time
> we make a determination of whether we should do llc_balance() without knowing
> which exact task we're going to move, but still observe the migration policy
> that shouldn't cause too much imbalance.  So it may not be strictly tied to a task
> in the current implementation.
> 
Ah right, by setting task_util to 0

>> Tasks have different
>> preferences for llc stacking, they can all be running in the same system at the
>> same time. This way you can offer a greater deal of optimization without much
>> burden to others.
> 
> You're thinking of something like a prctl knob that will bias aggregation for
> some process?  Wonder if Peter has some opinion on this.
> 

Yes. I am sure he has hhh but we can wait until the global approach is good enough
like Chen Yu said.

>>
>> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE?
>>
> 
> Actually we think that we can do without SCHED_CACHE_WAKE feature and rely only
> on load balance SCHED_CACHE_LB.  But still keeping 
> 
>>  Does setting
>> sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
> 
> Aggregation will tend to make utilization on the preferred LLC to be more
> than the non-preferred one.  Parameter "sysctl_llc_aggr_imb" is the imbalance
> allowed.  If we set this to 0, as long as the preferred LLC is not utilized
> more than the source LLC, we could still aggregate towards the preferred LLC
> and a preference could still be there.  
> 

I see, I think I have better understanding of this now. Thanks!

Libo

> Tim
> 
>>
>> Thanks,
>> Libo
>>
>>> +static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
>>> +					   unsigned long tsk_util,
>>> +					   bool to_pref)
>>> +{
>>> +	unsigned long src_util, dst_util, src_cap, dst_cap;
>>> +
>>> +	if (cpus_share_cache(src_cpu, dst_cpu))
>>> +		return mig_allow;
>>> +
>>> +	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
>>> +	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
>>> +		return mig_allow;
>>> +
>>> +	if (!fits_llc_capacity(dst_util, dst_cap) &&
>>> +	    !fits_llc_capacity(src_util, src_cap))
>>> +		return mig_ignore;
>>> +
>>> +	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
>>> +	dst_util = dst_util + tsk_util;
>>> +	if (to_pref) {
>>> +		/*
>>> +		 * sysctl_llc_aggr_imb is the imbalance allowed between
>>> +		 * preferred LLC and non-preferred LLC.
>>> +		 * Don't migrate if we will get preferred LLC too
>>> +		 * heavily loaded and if the dest is much busier
>>> +		 * than the src, in which case migration will
>>> +		 * increase the imbalance too much.
>>> +		 */
>>> +		if (!fits_llc_capacity(dst_util, dst_cap) &&
>>> +		    util_greater(dst_util, src_util))
>>> +			return mig_forbid;
>>> +	} else {
>>> +		/*
>>> +		 * Don't migrate if we will leave preferred LLC
>>> +		 * too idle, or if this migration leads to the
>>> +		 * non-preferred LLC falls within sysctl_aggr_imb percent
>>> +		 * of preferred LLC, leading to migration again
>>> +		 * back to preferred LLC.
>>> +		 */
>>> +		if (fits_llc_capacity(src_util, src_cap) ||
>>> +		    !util_greater(src_util, dst_util))
>>> +			return mig_forbid;
>>> +	}
>>> +	return mig_allow;
>>> +}
>>
>>
>

Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Chen, Yu C 3 months ago

On 7/8/2025 8:41 AM, Libo Chen wrote:
> Hi Tim and Chenyu,
> 
> 
> On 6/18/25 11:27, Tim Chen wrote:
>> Cache-aware scheduling is designed to aggregate threads into their
>> preferred LLC, either via the task wake up path or the load balancing
>> path. One side effect is that when the preferred LLC is saturated,
>> more threads will continue to be stacked on it, degrading the workload's
>> latency. A strategy is needed to prevent this aggregation from going too
>> far such that the preferred LLC is too overloaded.
>>
>> Introduce helper function _get_migrate_hint() to implement the LLC
>> migration policy:
>>
>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>     are not too busy (<50% utilization, tunable), or the preferred
>>     LLC will not be too out of balanced from the non preferred LLC
>>     (>20% utilization, tunable, close to imbalance_pct of the LLC
>>     domain).
>> 2) Allow a task to be moved from the preferred LLC to the
>>     non-preferred one if the non-preferred LLC will not be too out
>>     of balanced from the preferred prompting an aggregation task
>>     migration later.  We are still experimenting with the aggregation
>>     and migration policy. Some other possibilities are policy based
>>     on LLC's load or average number of tasks running.  Those could
>>     be tried out by tweaking _get_migrate_hint().
>>
>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
>> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
>> +
> 
> 
> I think this patch has a great potential.
> 
> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
> per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
> preferences for llc stacking, they can all be running in the same system at the
> same time. This way you can offer a greater deal of optimization without much
> burden to others.

Yes, this doable. It can be evaluated after the global generic strategy
has been verified to work, like NUMA balancing :)

> 
> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE? 

Do you mean the SCHED_CACHE_WAKE or SCHED_CACHE_LB?

> Does setting sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
> 

My understanding is that, if sysctl_llc_aggr_imb is 0, the task aggregation
might still consider other aspects, like if that target LLC's 
utilization has
exceeded 50% or not.

thanks,
Chenyu> Thanks,
> Libo
> 
>> +static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
>> +					   unsigned long tsk_util,
>> +					   bool to_pref)
>> +{
>> +	unsigned long src_util, dst_util, src_cap, dst_cap;
>> +
>> +	if (cpus_share_cache(src_cpu, dst_cpu))
>> +		return mig_allow;
>> +
>> +	if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
>> +	    !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
>> +		return mig_allow;
>> +
>> +	if (!fits_llc_capacity(dst_util, dst_cap) &&
>> +	    !fits_llc_capacity(src_util, src_cap))
>> +		return mig_ignore;
>> +
>> +	src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
>> +	dst_util = dst_util + tsk_util;
>> +	if (to_pref) {
>> +		/*
>> +		 * sysctl_llc_aggr_imb is the imbalance allowed between
>> +		 * preferred LLC and non-preferred LLC.
>> +		 * Don't migrate if we will get preferred LLC too
>> +		 * heavily loaded and if the dest is much busier
>> +		 * than the src, in which case migration will
>> +		 * increase the imbalance too much.
>> +		 */
>> +		if (!fits_llc_capacity(dst_util, dst_cap) &&
>> +		    util_greater(dst_util, src_util))
>> +			return mig_forbid;
>> +	} else {
>> +		/*
>> +		 * Don't migrate if we will leave preferred LLC
>> +		 * too idle, or if this migration leads to the
>> +		 * non-preferred LLC falls within sysctl_aggr_imb percent
>> +		 * of preferred LLC, leading to migration again
>> +		 * back to preferred LLC.
>> +		 */
>> +		if (fits_llc_capacity(src_util, src_cap) ||
>> +		    !util_greater(src_util, dst_util))
>> +			return mig_forbid;
>> +	}
>> +	return mig_allow;
>> +}
> 
>

Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Libo Chen 3 months ago


On 7/8/25 01:29, Chen, Yu C wrote:
> On 7/8/2025 8:41 AM, Libo Chen wrote:
>> Hi Tim and Chenyu,
>>
>>
>> On 6/18/25 11:27, Tim Chen wrote:
>>> Cache-aware scheduling is designed to aggregate threads into their
>>> preferred LLC, either via the task wake up path or the load balancing
>>> path. One side effect is that when the preferred LLC is saturated,
>>> more threads will continue to be stacked on it, degrading the workload's
>>> latency. A strategy is needed to prevent this aggregation from going too
>>> far such that the preferred LLC is too overloaded.
>>>
>>> Introduce helper function _get_migrate_hint() to implement the LLC
>>> migration policy:
>>>
>>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>>     are not too busy (<50% utilization, tunable), or the preferred
>>>     LLC will not be too out of balanced from the non preferred LLC
>>>     (>20% utilization, tunable, close to imbalance_pct of the LLC
>>>     domain).
>>> 2) Allow a task to be moved from the preferred LLC to the
>>>     non-preferred one if the non-preferred LLC will not be too out
>>>     of balanced from the preferred prompting an aggregation task
>>>     migration later.  We are still experimenting with the aggregation
>>>     and migration policy. Some other possibilities are policy based
>>>     on LLC's load or average number of tasks running.  Those could
>>>     be tried out by tweaking _get_migrate_hint().
>>>
>>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>>> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
>>> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
>>> +
>>
>>
>> I think this patch has a great potential.
>>
>> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
>> per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
>> preferences for llc stacking, they can all be running in the same system at the
>> same time. This way you can offer a greater deal of optimization without much
>> burden to others.
> 
> Yes, this doable. It can be evaluated after the global generic strategy
> has been verified to work, like NUMA balancing :)
> 

I will run some real-world workloads and get back to you (may take some time)

>>
>> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE? 
> 
> Do you mean the SCHED_CACHE_WAKE or SCHED_CACHE_LB?
> 

Ah I was thinking sysctl_llc_aggr_imb alone can help reduce overstacking on
target LLC from a few hyperactive wakees (may consider to ratelimit those
wakees as a solution), but just realize this can affect lb as well and doesn't
really reduce overheads from frequent wakeups (no good idea on top of my head
but we should find a better solution than sched_feat to address the overhead issue).



>> Does setting sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
>>
> 
> My understanding is that, if sysctl_llc_aggr_imb is 0, the task aggregation
> might still consider other aspects, like if that target LLC's utilization has
> exceeded 50% or not.
> 

which can be controlled by sysctl_llc_aggr_cap, right? Okay so if both LLCs have
<$(sysctl_llc_aggr_cap)% utilization, should sysctl_llc_aggr_cap be the only
determining factor here barring NUMA balancing?

Libo

> thanks,
> Chenyu> Thanks,
>> Libo
>>
>>> +static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
>>> +                       unsigned long tsk_util,
>>> +                       bool to_pref)
>>> +{
>>> +    unsigned long src_util, dst_util, src_cap, dst_cap;
>>> +
>>> +    if (cpus_share_cache(src_cpu, dst_cpu))
>>> +        return mig_allow;
>>> +
>>> +    if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
>>> +        !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
>>> +        return mig_allow;
>>> +
>>> +    if (!fits_llc_capacity(dst_util, dst_cap) &&
>>> +        !fits_llc_capacity(src_util, src_cap))
>>> +        return mig_ignore;
>>> +
>>> +    src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
>>> +    dst_util = dst_util + tsk_util;
>>> +    if (to_pref) {
>>> +        /*
>>> +         * sysctl_llc_aggr_imb is the imbalance allowed between
>>> +         * preferred LLC and non-preferred LLC.
>>> +         * Don't migrate if we will get preferred LLC too
>>> +         * heavily loaded and if the dest is much busier
>>> +         * than the src, in which case migration will
>>> +         * increase the imbalance too much.
>>> +         */
>>> +        if (!fits_llc_capacity(dst_util, dst_cap) &&
>>> +            util_greater(dst_util, src_util))
>>> +            return mig_forbid;
>>> +    } else {
>>> +        /*
>>> +         * Don't migrate if we will leave preferred LLC
>>> +         * too idle, or if this migration leads to the
>>> +         * non-preferred LLC falls within sysctl_aggr_imb percent
>>> +         * of preferred LLC, leading to migration again
>>> +         * back to preferred LLC.
>>> +         */
>>> +        if (fits_llc_capacity(src_util, src_cap) ||
>>> +            !util_greater(src_util, dst_util))
>>> +            return mig_forbid;
>>> +    }
>>> +    return mig_allow;
>>> +}
>>
>>

Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Chen, Yu C 3 months ago

On 7/9/2025 1:22 AM, Libo Chen wrote:
> 
> 
> On 7/8/25 01:29, Chen, Yu C wrote:
>> On 7/8/2025 8:41 AM, Libo Chen wrote:
>>> Hi Tim and Chenyu,
>>>
>>>
>>> On 6/18/25 11:27, Tim Chen wrote:
>>>> Cache-aware scheduling is designed to aggregate threads into their
>>>> preferred LLC, either via the task wake up path or the load balancing
>>>> path. One side effect is that when the preferred LLC is saturated,
>>>> more threads will continue to be stacked on it, degrading the workload's
>>>> latency. A strategy is needed to prevent this aggregation from going too
>>>> far such that the preferred LLC is too overloaded.
>>>>
>>>> Introduce helper function _get_migrate_hint() to implement the LLC
>>>> migration policy:
>>>>
>>>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>>>      are not too busy (<50% utilization, tunable), or the preferred
>>>>      LLC will not be too out of balanced from the non preferred LLC
>>>>      (>20% utilization, tunable, close to imbalance_pct of the LLC
>>>>      domain).
>>>> 2) Allow a task to be moved from the preferred LLC to the
>>>>      non-preferred one if the non-preferred LLC will not be too out
>>>>      of balanced from the preferred prompting an aggregation task
>>>>      migration later.  We are still experimenting with the aggregation
>>>>      and migration policy. Some other possibilities are policy based
>>>>      on LLC's load or average number of tasks running.  Those could
>>>>      be tried out by tweaking _get_migrate_hint().
>>>>
>>>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>>>> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
>>>> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
>>>> +
>>>
>>>
>>> I think this patch has a great potential.
>>>
>>> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
>>> per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
>>> preferences for llc stacking, they can all be running in the same system at the
>>> same time. This way you can offer a greater deal of optimization without much
>>> burden to others.
>>
>> Yes, this doable. It can be evaluated after the global generic strategy
>> has been verified to work, like NUMA balancing :)
>>
> 
> I will run some real-world workloads and get back to you (may take some time)
> 

Thanks. It seems that there are pros and cons for different
workloads and we are evaluating adding the RSS/active nr_running
per process to deal with different type of workloads.

>>>
>>> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE?
>>
>> Do you mean the SCHED_CACHE_WAKE or SCHED_CACHE_LB?
>>
> 
> Ah I was thinking sysctl_llc_aggr_imb alone can help reduce overstacking on
> target LLC from a few hyperactive wakees (may consider to ratelimit those
> wakees as a solution), but just realize this can affect lb as well and doesn't
> really reduce overheads from frequent wakeups (no good idea on top of my head
> but we should find a better solution than sched_feat to address the overhead issue).
> 
> 
> 
>>> Does setting sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
>>>
>>
>> My understanding is that, if sysctl_llc_aggr_imb is 0, the task aggregation
>> might still consider other aspects, like if that target LLC's utilization has
>> exceeded 50% or not.
>>
> 
> which can be controlled by sysctl_llc_aggr_cap, right? Okay so if both LLCs have
> <$(sysctl_llc_aggr_cap)% utilization, should sysctl_llc_aggr_cap be the only
> determining factor here barring NUMA balancing?
> 

If both LLC are under (sysctl_llc_aggr_cap)%, then the strategy is still 
to allow
task to be aggregated to its preferred LLC, by either asking the task to 
not be
pulled out of its preferred LLC, or migrate task to its preferred LLC,
in _get_migrate_hint().

Thanks,
Chenyu

Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling

Posted by Libo Chen 3 months ago


On 7/9/25 07:41, Chen, Yu C wrote:
> On 7/9/2025 1:22 AM, Libo Chen wrote:
>>
>>
>> On 7/8/25 01:29, Chen, Yu C wrote:
>>> On 7/8/2025 8:41 AM, Libo Chen wrote:
>>>> Hi Tim and Chenyu,
>>>>
>>>>
>>>> On 6/18/25 11:27, Tim Chen wrote:
>>>>> Cache-aware scheduling is designed to aggregate threads into their
>>>>> preferred LLC, either via the task wake up path or the load balancing
>>>>> path. One side effect is that when the preferred LLC is saturated,
>>>>> more threads will continue to be stacked on it, degrading the workload's
>>>>> latency. A strategy is needed to prevent this aggregation from going too
>>>>> far such that the preferred LLC is too overloaded.
>>>>>
>>>>> Introduce helper function _get_migrate_hint() to implement the LLC
>>>>> migration policy:
>>>>>
>>>>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>>>>      are not too busy (<50% utilization, tunable), or the preferred
>>>>>      LLC will not be too out of balanced from the non preferred LLC
>>>>>      (>20% utilization, tunable, close to imbalance_pct of the LLC
>>>>>      domain).
>>>>> 2) Allow a task to be moved from the preferred LLC to the
>>>>>      non-preferred one if the non-preferred LLC will not be too out
>>>>>      of balanced from the preferred prompting an aggregation task
>>>>>      migration later.  We are still experimenting with the aggregation
>>>>>      and migration policy. Some other possibilities are policy based
>>>>>      on LLC's load or average number of tasks running.  Those could
>>>>>      be tried out by tweaking _get_migrate_hint().
>>>>>
>>>>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>>>>> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
>>>>> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
>>>>> +
>>>>
>>>>
>>>> I think this patch has a great potential.
>>>>
>>>> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
>>>> per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
>>>> preferences for llc stacking, they can all be running in the same system at the
>>>> same time. This way you can offer a greater deal of optimization without much
>>>> burden to others.
>>>
>>> Yes, this doable. It can be evaluated after the global generic strategy
>>> has been verified to work, like NUMA balancing :)
>>>
>>
>> I will run some real-world workloads and get back to you (may take some time)
>>
> 
> Thanks. It seems that there are pros and cons for different
> workloads and we are evaluating adding the RSS/active nr_running
> per process to deal with different type of workloads.
> 
>>>>
>>>> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE?
>>>
>>> Do you mean the SCHED_CACHE_WAKE or SCHED_CACHE_LB?
>>>
>>
>> Ah I was thinking sysctl_llc_aggr_imb alone can help reduce overstacking on
>> target LLC from a few hyperactive wakees (may consider to ratelimit those
>> wakees as a solution), but just realize this can affect lb as well and doesn't
>> really reduce overheads from frequent wakeups (no good idea on top of my head
>> but we should find a better solution than sched_feat to address the overhead issue).
>>
btw just for correction, I meant wakers here not wakees 
>>
>>
>>>> Does setting sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
>>>>
>>>
>>> My understanding is that, if sysctl_llc_aggr_imb is 0, the task aggregation
>>> might still consider other aspects, like if that target LLC's utilization has
>>> exceeded 50% or not.
>>>
>>
>> which can be controlled by sysctl_llc_aggr_cap, right? Okay so if both LLCs have
>> <$(sysctl_llc_aggr_cap)% utilization, should sysctl_llc_aggr_cap be the only
>> determining factor here barring NUMA balancing?
>>
> 
> If both LLC are under (sysctl_llc_aggr_cap)%, then the strategy is still to allow
> task to be aggregated to its preferred LLC, by either asking the task to not be
> pulled out of its preferred LLC, or migrate task to its preferred LLC,
> in _get_migrate_hint().
> 
Ok, got it. It looks to me sysctl_llc_aggr_imb and sysctl_llc_aggr_cap can have quite
an impact on perf. I will play around with different values a bit.

Libo


> Thanks,
> Chenyu