Cache aware load-balancing

[RFC PATCH v4 26/28] sched: Do not enable cache aware scheduling for process with large RSS

Posted by Chen Yu 1 month, 3 weeks ago

It has been reported that when running memory-intensive workloads
such as stream, sched_cache may saturate the memory bandwidth on
the preferred LLC.

To prevent this from happening, evaluate the process's memory
footprint by checking the size of RSS (anonymous pages and shared
pages) and comparing it to the size of the LLC. If the former is
larger, skip cache-aware scheduling. This is because if tasks
do not actually share data, aggregating tasks with large RSS will
likely result in cache contention and performance depredation.

However, in theory, RSS is not the same as memory footprint.
This is just an estimated approach to prevent over-aggregation.
The default behavior is to strictly compare the size of RSS with
the size of the LLC. The next patch will introduce a user-provided
hint to customize this comparison.

Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 44 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4bf794f170cf..cbda7dad1305 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1205,6 +1205,34 @@ static inline int pref_llc_idx(struct task_struct *p)
 	return llc_idx(p->preferred_llc);
 }
 
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+{
+	struct cpu_cacheinfo *this_cpu_ci;
+	struct cacheinfo *l3_leaf;
+	unsigned long rss;
+	unsigned int llc;
+
+	/*
+	 * get_cpu_cacheinfo_level() can not be used
+	 * because it requires the cpu_hotplug_lock
+	 * to be held. Use get_cpu_cacheinfo()
+	 * directly because the 'cpu' can not be
+	 * offlined at the moment.
+	 */
+	this_cpu_ci = get_cpu_cacheinfo(cpu);
+	if (!this_cpu_ci->info_list ||
+	    this_cpu_ci->num_leaves < 3)
+		return true;
+
+	l3_leaf = this_cpu_ci->info_list + 3;
+	llc = l3_leaf->size;
+
+	rss = get_mm_counter(mm, MM_ANONPAGES) +
+		get_mm_counter(mm, MM_SHMEMPAGES);
+
+	return (llc <= (rss * PAGE_SIZE));
+}
+
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
 	int smt_nr = 1;
@@ -1363,7 +1391,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 */
 	if (epoch - READ_ONCE(mm->mm_sched_epoch) > sysctl_llc_old ||
 	    get_nr_threads(p) <= 1 ||
-	    exceed_llc_nr(mm, cpu_of(rq))) {
+	    exceed_llc_nr(mm, cpu_of(rq)) ||
+	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		mm->mm_sched_cpu = -1;
 		pcpu_sched->occ = 0;
 	}
@@ -1448,6 +1477,14 @@ static void __no_profile task_cache_work(struct callback_head *work)
 		return;
 	}
 
+	/*
+	 * Do not check exceed_llc_nr() because
+	 * the active number of threads needs to
+	 * been updated anyway.
+	 */
+	if (exceed_llc_capacity(mm, curr_cpu))
+		return;
+
 	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
 		return;
 
@@ -9113,8 +9150,12 @@ static __maybe_unused enum llc_mig_hint get_migrate_hint(int src_cpu, int dst_cp
 	if (cpu < 0)
 		return mig_allow;
 
-	 /* skip cache aware load balance for single/too many threads */
-	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
+	/*
+	 * skip cache aware load balance for single/too many threads
+	 * and large footprint.
+	 */
+	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
+	    exceed_llc_capacity(mm, dst_cpu))
 		return mig_allow;
 
 	if (cpus_share_cache(dst_cpu, cpu))
-- 
2.25.1

Re: [RFC PATCH v4 26/28] sched: Do not enable cache aware scheduling for process with large RSS

Posted by Adam Li 1 week, 1 day ago

Hi Chen Yu,

Thanks for your work.
I tested the patch set on AmpereOne CPU with 192 cores.

With CONFIG_SCHED_CLUSTER enabled, and with certain firmware setting,
every eight cores will be grouped into a 'cluster' schedule domain
with 'SD_SHARE_LLC' flag.
However, these eight cores do *no* share L3 cache in this setup.

In exceed_llc_capacity() of this patch, we have 'llc = l3_leaf->size',
'llc' will be zero if there is *no* L3 cache.
So exceed_llc_capacity() will be true and 'Cache Aware Scheduling' will
not work. Please see details bellow.

I read in patch 01/28 "sched: Cache aware load-balancing" [1],
Peter mentioned:
"It is an attempt at modelling cache affinity -- and while the patch
really only targets LLC, it could very well be extended to also apply to
clusters (L2). Specifically any case of multiple cache domains inside a
node".

Do you have any idea how we can apply the cache aware load-balancing
to clusters? The cores in the cluster may share L2 or LLC tags.

[1]: https://lore.kernel.org/all/9157186cf9e3fd541f62c637579ff736b3704c51.1754712565.git.tim.c.chen@linux.intel.com/

On 8/9/2025 1:08 PM, Chen Yu wrote:
> It has been reported that when running memory-intensive workloads
> such as stream, sched_cache may saturate the memory bandwidth on
> the preferred LLC.
> 
> To prevent this from happening, evaluate the process's memory
> footprint by checking the size of RSS (anonymous pages and shared
> pages) and comparing it to the size of the LLC. If the former is
> larger, skip cache-aware scheduling. This is because if tasks
> do not actually share data, aggregating tasks with large RSS will
> likely result in cache contention and performance depredation.
> 
> However, in theory, RSS is not the same as memory footprint.
> This is just an estimated approach to prevent over-aggregation.
> The default behavior is to strictly compare the size of RSS with
> the size of the LLC. The next patch will introduce a user-provided
> hint to customize this comparison.
> 
> Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 44 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4bf794f170cf..cbda7dad1305 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1205,6 +1205,34 @@ static inline int pref_llc_idx(struct task_struct *p)
>  	return llc_idx(p->preferred_llc);
>  }
>  
> +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
> +{
> +	struct cpu_cacheinfo *this_cpu_ci;
> +	struct cacheinfo *l3_leaf;
> +	unsigned long rss;
> +	unsigned int llc;
> +
> +	/*
> +	 * get_cpu_cacheinfo_level() can not be used
> +	 * because it requires the cpu_hotplug_lock
> +	 * to be held. Use get_cpu_cacheinfo()
> +	 * directly because the 'cpu' can not be
> +	 * offlined at the moment.
> +	 */
> +	this_cpu_ci = get_cpu_cacheinfo(cpu);
> +	if (!this_cpu_ci->info_list ||
> +	    this_cpu_ci->num_leaves < 3)
> +		return true;
> +
> +	l3_leaf = this_cpu_ci->info_list + 3;
> +	llc = l3_leaf->size;
> +
For some arm64 CPU topology, cores can be grouped into 'cluster'.
Cores in a cluster may not share L3 cache. 'l3_leaf->size'
will be 0.

It looks we assume LLC is L3 cache?

Can we skip exceed_llc_capacity() check if no L3?
Like this draft patch:

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1227,6 +1227,8 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)

        l3_leaf = this_cpu_ci->info_list + 3;
        llc = l3_leaf->size;
+       if (!llc)
+               return false;

        rss = get_mm_counter(mm, MM_ANONPAGES) +
                get_mm_counter(mm, MM_SHMEMPAGES);


> +	rss = get_mm_counter(mm, MM_ANONPAGES) +
> +		get_mm_counter(mm, MM_SHMEMPAGES);
> +
> +	return (llc <= (rss * PAGE_SIZE));

If 'llc' is 0, exceed_llc_capacity() will always return true.

> +}
> +
>  static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>  {
>  	int smt_nr = 1;
> @@ -1363,7 +1391,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>  	 */
>  	if (epoch - READ_ONCE(mm->mm_sched_epoch) > sysctl_llc_old ||
>  	    get_nr_threads(p) <= 1 ||
> -	    exceed_llc_nr(mm, cpu_of(rq))) {
> +	    exceed_llc_nr(mm, cpu_of(rq)) ||
> +	    exceed_llc_capacity(mm, cpu_of(rq))) {
>  		mm->mm_sched_cpu = -1;
>  		pcpu_sched->occ = 0;
>  	}
> @@ -1448,6 +1477,14 @@ static void __no_profile task_cache_work(struct callback_head *work)
>  		return;
>  	}
>  
> +	/*
> +	 * Do not check exceed_llc_nr() because
> +	 * the active number of threads needs to
> +	 * been updated anyway.
> +	 */
> +	if (exceed_llc_capacity(mm, curr_cpu))
> +		return;
> +
>  	if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
>  		return;
>  
> @@ -9113,8 +9150,12 @@ static __maybe_unused enum llc_mig_hint get_migrate_hint(int src_cpu, int dst_cp
>  	if (cpu < 0)
>  		return mig_allow;
>  
> -	 /* skip cache aware load balance for single/too many threads */
> -	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
> +	/*
> +	 * skip cache aware load balance for single/too many threads
> +	 * and large footprint.
> +	 */
> +	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
> +	    exceed_llc_capacity(mm, dst_cpu))
>  		return mig_allow;
>  
>  	if (cpus_share_cache(dst_cpu, cpu))

Thanks,-adam

Re: [RFC PATCH v4 26/28] sched: Do not enable cache aware scheduling for process with large RSS

Posted by Chen, Yu C 1 week, 1 day ago

On 9/26/2025 4:48 PM, Adam Li wrote:
> Hi Chen Yu,
> 
> Thanks for your work.
> I tested the patch set on AmpereOne CPU with 192 cores.
> 
> With CONFIG_SCHED_CLUSTER enabled, and with certain firmware setting,
> every eight cores will be grouped into a 'cluster' schedule domain
> with 'SD_SHARE_LLC' flag.
> However, these eight cores do *no* share L3 cache in this setup.
> 
> In exceed_llc_capacity() of this patch, we have 'llc = l3_leaf->size',
> 'llc' will be zero if there is *no* L3 cache.
> So exceed_llc_capacity() will be true and 'Cache Aware Scheduling' will
> not work. Please see details bellow.
> 
> I read in patch 01/28 "sched: Cache aware load-balancing" [1],
> Peter mentioned:
> "It is an attempt at modelling cache affinity -- and while the patch
> really only targets LLC, it could very well be extended to also apply to
> clusters (L2). Specifically any case of multiple cache domains inside a
> node".
> 
> Do you have any idea how we can apply the cache aware load-balancing
> to clusters? The cores in the cluster may share L2 or LLC tags.

My understanding is that if there is no L3 cache, then the L2 becomes
the LLC. We don’t need to modify the code specific to L2-aware scheduling
because the L2 is now the last-level cache (LLC). However, as you observed,
there are some cases that need to be taken care of. For example, Patch 8
needs to be fixed so that it does not always retrieve the cache size of
L3.

On the other hand, if the system has both an L2 cluster and an L3, the
code might need to be changed if we want to perform L2 cache aggregation
rather than L3 cache aggregation.

> 
> [1]: https://lore.kernel.org/all/9157186cf9e3fd541f62c637579ff736b3704c51.1754712565.git.tim.c.chen@linux.intel.com/
> 
> On 8/9/2025 1:08 PM, Chen Yu wrote:
>> +
>> +	l3_leaf = this_cpu_ci->info_list + 3;
>> +	llc = l3_leaf->size;
>> +
> For some arm64 CPU topology, cores can be grouped into 'cluster'.
> Cores in a cluster may not share L3 cache. 'l3_leaf->size'
> will be 0.
> 
> It looks we assume LLC is L3 cache?

Right, but LLC should not always be L3, need a fix here.

> 
> Can we skip exceed_llc_capacity() check if no L3?

I thought we should return the size of L2 instead, no?

thanks,
Chenyu> Like this draft patch:
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1227,6 +1227,8 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
> 
>          l3_leaf = this_cpu_ci->info_list + 3;
>          llc = l3_leaf->size;
> +       if (!llc)
> +               return false;