[v2] Cache aware scheduling

[PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

Posted by Tim Chen 2 months ago

From: Chen Yu <yu.c.chen@intel.com>

Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.

To mitigate this, estimate a process's memory footprint by comparing
its RSS (anonymous and shared pages) to the size of the LLC. If RSS
exceeds the LLC size, skip cache-aware scheduling.

Note that RSS is only an approximation of the memory footprint.
By default, the comparison is strict, but a later patch will allow
users to provide a hint to adjust this threshold.

According to the test from Adam, some systems do not have shared L3
but with shared L2 as clusters. In this case, the L2 becomes the LLC[1].

Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@os.amperecomputing.com/

Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Assigned curr_cpu in task_cache_work() before checking
            exceed_llc_capacity(mm, curr_cpu) to avoid out-of-bound
            access.(lkp/0day)

 include/linux/cacheinfo.h | 21 ++++++++++-------
 kernel/sched/fair.c       | 49 +++++++++++++++++++++++++++++++++++----
 2 files changed, 57 insertions(+), 13 deletions(-)

diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index c8f4f0a0b874..82d0d59ca0e1 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
 
 const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf);
 
-/*
- * Get the cacheinfo structure for the cache associated with @cpu at
- * level @level.
- * cpuhp lock must be held.
- */
-static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int level)
 {
 	struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
 	int i;
 
-	lockdep_assert_cpus_held();
-
 	for (i = 0; i < ci->num_leaves; i++) {
 		if (ci->info_list[i].level == level) {
 			if (ci->info_list[i].attributes & CACHE_ID)
@@ -136,6 +129,18 @@ static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
 	return NULL;
 }
 
+/*
+ * Get the cacheinfo structure for the cache associated with @cpu at
+ * level @level.
+ * cpuhp lock must be held.
+ */
+static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
+{
+	lockdep_assert_cpus_held();
+
+	return _get_cpu_cacheinfo_level(cpu, level);
+}
+
 /*
  * Get the id of the cache associated with @cpu at level @level.
  * cpuhp lock must be held.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6afa3f9a4e9b..424ec601cfdf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1223,6 +1223,38 @@ static int llc_id(int cpu)
 	return llc;
 }
 
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+{
+	struct cacheinfo *ci;
+	unsigned long rss;
+	unsigned int llc;
+
+	/*
+	 * get_cpu_cacheinfo_level() can not be used
+	 * because it requires the cpu_hotplug_lock
+	 * to be held. Use _get_cpu_cacheinfo_level()
+	 * directly because the 'cpu' can not be
+	 * offlined at the moment.
+	 */
+	ci = _get_cpu_cacheinfo_level(cpu, 3);
+	if (!ci) {
+		/*
+		 * On system without L3 but with shared L2,
+		 * L2 becomes the LLC.
+		 */
+		ci = _get_cpu_cacheinfo_level(cpu, 2);
+		if (!ci)
+			return true;
+	}
+
+	llc = ci->size;
+
+	rss = get_mm_counter(mm, MM_ANONPAGES) +
+		get_mm_counter(mm, MM_SHMEMPAGES);
+
+	return (llc <= (rss * PAGE_SIZE));
+}
+
 static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
 {
 	int smt_nr = 1;
@@ -1382,7 +1414,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
 	 */
 	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
 	    get_nr_threads(p) <= 1 ||
-	    exceed_llc_nr(mm, cpu_of(rq))) {
+	    exceed_llc_nr(mm, cpu_of(rq)) ||
+	    exceed_llc_capacity(mm, cpu_of(rq))) {
 		if (mm->mm_sched_cpu != -1)
 			mm->mm_sched_cpu = -1;
 	}
@@ -1439,7 +1472,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
 	struct mm_struct *mm = p->mm;
 	unsigned long m_a_occ = 0;
 	unsigned long curr_m_a_occ = 0;
-	int cpu, m_a_cpu = -1, nr_running = 0;
+	int cpu, m_a_cpu = -1, nr_running = 0, curr_cpu;
 	cpumask_var_t cpus;
 
 	WARN_ON_ONCE(work != &p->cache_work);
@@ -1449,7 +1482,9 @@ static void __no_profile task_cache_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
-	if (get_nr_threads(p) <= 1) {
+	curr_cpu = task_cpu(p);
+	if (get_nr_threads(p) <= 1 ||
+	    exceed_llc_capacity(mm, curr_cpu)) {
 		if (mm->mm_sched_cpu != -1)
 			mm->mm_sched_cpu = -1;
 
@@ -9895,8 +9930,12 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
 	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
 		return mig_unrestricted;
 
-	/* skip cache aware load balance for single/too many threads */
-	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
+	/*
+	 * Skip cache aware load balance for single/too many threads
+	 * or large footprint.
+	 */
+	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
+	    exceed_llc_capacity(mm, dst_cpu))
 		return mig_unrestricted;
 
 	if (cpus_share_cache(dst_cpu, cpu))
-- 
2.32.0

Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

Posted by Vern Hao 1 month, 3 weeks ago

On 2025/12/4 07:07, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
>
> Prateek and Tingyin reported that memory-intensive workloads (such as
> stream) can saturate memory bandwidth and caches on the preferred LLC
> when sched_cache aggregates too many threads.
>
> To mitigate this, estimate a process's memory footprint by comparing
> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
> exceeds the LLC size, skip cache-aware scheduling.
Restricting RSS prevents many applications from benefiting from this 
optimization. I believe this restriction should be lifted. For 
memory-intensive workloads, the optimization may simply yield no gains, 
but it certainly shouldn't make performance worse. We need to further 
refine this logic.
> Note that RSS is only an approximation of the memory footprint.
> By default, the comparison is strict, but a later patch will allow
> users to provide a hint to adjust this threshold.
>
> According to the test from Adam, some systems do not have shared L3
> but with shared L2 as clusters. In this case, the L2 becomes the LLC[1].
>
> Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@os.amperecomputing.com/
>
> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>
> Notes:
>      v1->v2: Assigned curr_cpu in task_cache_work() before checking
>              exceed_llc_capacity(mm, curr_cpu) to avoid out-of-bound
>              access.(lkp/0day)
>
>   include/linux/cacheinfo.h | 21 ++++++++++-------
>   kernel/sched/fair.c       | 49 +++++++++++++++++++++++++++++++++++----
>   2 files changed, 57 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
> index c8f4f0a0b874..82d0d59ca0e1 100644
> --- a/include/linux/cacheinfo.h
> +++ b/include/linux/cacheinfo.h
> @@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
>   
>   const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf);
>   
> -/*
> - * Get the cacheinfo structure for the cache associated with @cpu at
> - * level @level.
> - * cpuhp lock must be held.
> - */
> -static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
> +static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int level)
>   {
>   	struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
>   	int i;
>   
> -	lockdep_assert_cpus_held();
> -
>   	for (i = 0; i < ci->num_leaves; i++) {
>   		if (ci->info_list[i].level == level) {
>   			if (ci->info_list[i].attributes & CACHE_ID)
> @@ -136,6 +129,18 @@ static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
>   	return NULL;
>   }
>   
> +/*
> + * Get the cacheinfo structure for the cache associated with @cpu at
> + * level @level.
> + * cpuhp lock must be held.
> + */
> +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level)
> +{
> +	lockdep_assert_cpus_held();
> +
> +	return _get_cpu_cacheinfo_level(cpu, level);
> +}
> +
>   /*
>    * Get the id of the cache associated with @cpu at level @level.
>    * cpuhp lock must be held.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6afa3f9a4e9b..424ec601cfdf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1223,6 +1223,38 @@ static int llc_id(int cpu)
>   	return llc;
>   }
>   
> +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
> +{
> +	struct cacheinfo *ci;
> +	unsigned long rss;
> +	unsigned int llc;
> +
> +	/*
> +	 * get_cpu_cacheinfo_level() can not be used
> +	 * because it requires the cpu_hotplug_lock
> +	 * to be held. Use _get_cpu_cacheinfo_level()
> +	 * directly because the 'cpu' can not be
> +	 * offlined at the moment.
> +	 */
> +	ci = _get_cpu_cacheinfo_level(cpu, 3);
> +	if (!ci) {
> +		/*
> +		 * On system without L3 but with shared L2,
> +		 * L2 becomes the LLC.
> +		 */
> +		ci = _get_cpu_cacheinfo_level(cpu, 2);
> +		if (!ci)
> +			return true;
> +	}
Is there must call it one by one for get llc size? a static variable 
instead in building sched domain?
> +
> +	llc = ci->size;
> +
> +	rss = get_mm_counter(mm, MM_ANONPAGES) +
> +		get_mm_counter(mm, MM_SHMEMPAGES);
> +
> +	return (llc <= (rss * PAGE_SIZE));
> +}
> +
>   static bool exceed_llc_nr(struct mm_struct *mm, int cpu)
>   {
>   	int smt_nr = 1;
> @@ -1382,7 +1414,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
>   	 */
>   	if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
>   	    get_nr_threads(p) <= 1 ||
> -	    exceed_llc_nr(mm, cpu_of(rq))) {
> +	    exceed_llc_nr(mm, cpu_of(rq)) ||
> +	    exceed_llc_capacity(mm, cpu_of(rq))) {
>   		if (mm->mm_sched_cpu != -1)
>   			mm->mm_sched_cpu = -1;
>   	}
> @@ -1439,7 +1472,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>   	struct mm_struct *mm = p->mm;
>   	unsigned long m_a_occ = 0;
>   	unsigned long curr_m_a_occ = 0;
> -	int cpu, m_a_cpu = -1, nr_running = 0;
> +	int cpu, m_a_cpu = -1, nr_running = 0, curr_cpu;
>   	cpumask_var_t cpus;
>   
>   	WARN_ON_ONCE(work != &p->cache_work);
> @@ -1449,7 +1482,9 @@ static void __no_profile task_cache_work(struct callback_head *work)
>   	if (p->flags & PF_EXITING)
>   		return;
>   
> -	if (get_nr_threads(p) <= 1) {
> +	curr_cpu = task_cpu(p);
> +	if (get_nr_threads(p) <= 1 ||
> +	    exceed_llc_capacity(mm, curr_cpu)) {
>   		if (mm->mm_sched_cpu != -1)
>   			mm->mm_sched_cpu = -1;
>   
> @@ -9895,8 +9930,12 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
>   	if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
>   		return mig_unrestricted;
>   
> -	/* skip cache aware load balance for single/too many threads */
> -	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu))
> +	/*
> +	 * Skip cache aware load balance for single/too many threads
> +	 * or large footprint.
> +	 */
> +	if (get_nr_threads(p) <= 1 || exceed_llc_nr(mm, dst_cpu) ||
> +	    exceed_llc_capacity(mm, dst_cpu))
>   		return mig_unrestricted;
>   
>   	if (cpus_share_cache(dst_cpu, cpu))

Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

Posted by Chen, Yu C 1 month, 3 weeks ago

On 12/18/2025 11:59 AM, Vern Hao wrote:
> 
> On 2025/12/4 07:07, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> Prateek and Tingyin reported that memory-intensive workloads (such as
>> stream) can saturate memory bandwidth and caches on the preferred LLC
>> when sched_cache aggregates too many threads.
>>
>> To mitigate this, estimate a process's memory footprint by comparing
>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>> exceeds the LLC size, skip cache-aware scheduling.
> Restricting RSS prevents many applications from benefiting from this 
> optimization. I believe this restriction should be lifted. 
> For memory- 
> intensive workloads, the optimization may simply yield no gains, but it 
> certainly shouldn't make performance worse. We need to further refine 
> this logic.

Memory-intensive workloads may trigger performance regressions when
memory bandwidth(from L3 cache to memory controller) is saturated due
to task aggregation on single LLC. We have seen this issue in stream
benchmark runs in previous version.

Patch 23 introduces a debugfs knob llc_aggr_tolerance that lets userspace
tune the scale factor. This allows memory-intensive workloads to perform
task aggregation when their footprint is small and the administrator 
considers
it safe. As you noted in another patch, fine-grained control would improve
flexibility—and this can be addressed in future iterations.

>> Note that RSS is only an approximation of the memory footprint.
>> By default, the comparison is strict, but a later patch will allow
>> users to provide a hint to adjust this threshold.
>>
>> According to the test from Adam, some systems do not have shared L3
>> but with shared L2 as clusters. In this case, the L2 becomes the LLC[1].
>>
>> Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739- 
>> b00e28a09cb6@os.amperecomputing.com/
>>
>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>
>> Notes:
>>      v1->v2: Assigned curr_cpu in task_cache_work() before checking
>>              exceed_llc_capacity(mm, curr_cpu) to avoid out-of-bound
>>              access.(lkp/0day)
>>
>>   include/linux/cacheinfo.h | 21 ++++++++++-------
>>   kernel/sched/fair.c       | 49 +++++++++++++++++++++++++++++++++++----
>>   2 files changed, 57 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
>> index c8f4f0a0b874..82d0d59ca0e1 100644
>> --- a/include/linux/cacheinfo.h
>> +++ b/include/linux/cacheinfo.h
>> @@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
>>   const struct attribute_group *cache_get_priv_group(struct cacheinfo 
>> *this_leaf);
>> -/*
>> - * Get the cacheinfo structure for the cache associated with @cpu at
>> - * level @level.
>> - * cpuhp lock must be held.
>> - */
>> -static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int 
>> level)
>> +static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int 
>> level)
>>   {
>>       struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
>>       int i;
>> -    lockdep_assert_cpus_held();
>> -
>>       for (i = 0; i < ci->num_leaves; i++) {
>>           if (ci->info_list[i].level == level) {
>>               if (ci->info_list[i].attributes & CACHE_ID)
>> @@ -136,6 +129,18 @@ static inline struct cacheinfo 
>> *get_cpu_cacheinfo_level(int cpu, int level)
>>       return NULL;
>>   }
>> +/*
>> + * Get the cacheinfo structure for the cache associated with @cpu at
>> + * level @level.
>> + * cpuhp lock must be held.
>> + */
>> +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int 
>> level)
>> +{
>> +    lockdep_assert_cpus_held();
>> +
>> +    return _get_cpu_cacheinfo_level(cpu, level);
>> +}
>> +
>>   /*
>>    * Get the id of the cache associated with @cpu at level @level.
>>    * cpuhp lock must be held.
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6afa3f9a4e9b..424ec601cfdf 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1223,6 +1223,38 @@ static int llc_id(int cpu)
>>       return llc;
>>   }
>> +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>> +{
>> +    struct cacheinfo *ci;
>> +    unsigned long rss;
>> +    unsigned int llc;
>> +
>> +    /*
>> +     * get_cpu_cacheinfo_level() can not be used
>> +     * because it requires the cpu_hotplug_lock
>> +     * to be held. Use _get_cpu_cacheinfo_level()
>> +     * directly because the 'cpu' can not be
>> +     * offlined at the moment.
>> +     */
>> +    ci = _get_cpu_cacheinfo_level(cpu, 3);
>> +    if (!ci) {
>> +        /*
>> +         * On system without L3 but with shared L2,
>> +         * L2 becomes the LLC.
>> +         */
>> +        ci = _get_cpu_cacheinfo_level(cpu, 2);
>> +        if (!ci)
>> +            return true;
>> +    }
> Is there must call it one by one for get llc size? a static variable 
> instead in building sched domain?

I suppose you suggested introducing a per-CPU variable, like 
percpu(sd_llc_bytes, cpu),
or something similar to struct cpuinfo_x86.x86_cache_size. I am not sure 
if the community
would endorse introducing this variable, given that sched_cache would be 
its only user.
We can leave this as an open question.

thanks,
Chenyu

Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

Posted by Vern Hao 1 month, 3 weeks ago

On 2025/12/18 16:32, Chen, Yu C wrote:
> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>
>> On 2025/12/4 07:07, Tim Chen wrote:
>>> From: Chen Yu <yu.c.chen@intel.com>
>>>
>>> Prateek and Tingyin reported that memory-intensive workloads (such as
>>> stream) can saturate memory bandwidth and caches on the preferred LLC
>>> when sched_cache aggregates too many threads.
>>>
>>> To mitigate this, estimate a process's memory footprint by comparing
>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>> exceeds the LLC size, skip cache-aware scheduling.
>> Restricting RSS prevents many applications from benefiting from this 
>> optimization. I believe this restriction should be lifted. For 
>> memory- intensive workloads, the optimization may simply yield no 
>> gains, but it certainly shouldn't make performance worse. We need to 
>> further refine this logic.
>
> Memory-intensive workloads may trigger performance regressions when
> memory bandwidth(from L3 cache to memory controller) is saturated due
RSS size and bandwidth saturation are not necessarily linked, In my 
view, the optimization should be robust enough that it doesn't cause a 
noticeable drop in performance, no matter how large the RSS is. We need 
to have a more profound discussion on this.
> to task aggregation on single LLC. We have seen this issue in stream
> benchmark runs in previous version.
>
> Patch 23 introduces a debugfs knob llc_aggr_tolerance that lets userspace
> tune the scale factor. This allows memory-intensive workloads to perform
> task aggregation when their footprint is small and the administrator 
> considers
> it safe. As you noted in another patch, fine-grained control would 
> improve
> flexibility—and this can be addressed in future iterations.
>
>>> Note that RSS is only an approximation of the memory footprint.
>>> By default, the comparison is strict, but a later patch will allow
>>> users to provide a hint to adjust this threshold.
>>>
>>> According to the test from Adam, some systems do not have shared L3
>>> but with shared L2 as clusters. In this case, the L2 becomes the 
>>> LLC[1].
>>>
>>> Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739- 
>>> b00e28a09cb6@os.amperecomputing.com/
>>>
>>> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> ---
>>>
>>> Notes:
>>>      v1->v2: Assigned curr_cpu in task_cache_work() before checking
>>>              exceed_llc_capacity(mm, curr_cpu) to avoid out-of-bound
>>>              access.(lkp/0day)
>>>
>>>   include/linux/cacheinfo.h | 21 ++++++++++-------
>>>   kernel/sched/fair.c       | 49 
>>> +++++++++++++++++++++++++++++++++++----
>>>   2 files changed, 57 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
>>> index c8f4f0a0b874..82d0d59ca0e1 100644
>>> --- a/include/linux/cacheinfo.h
>>> +++ b/include/linux/cacheinfo.h
>>> @@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu,
>>>   const struct attribute_group *cache_get_priv_group(struct 
>>> cacheinfo *this_leaf);
>>> -/*
>>> - * Get the cacheinfo structure for the cache associated with @cpu at
>>> - * level @level.
>>> - * cpuhp lock must be held.
>>> - */
>>> -static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, 
>>> int level)
>>> +static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, 
>>> int level)
>>>   {
>>>       struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
>>>       int i;
>>> -    lockdep_assert_cpus_held();
>>> -
>>>       for (i = 0; i < ci->num_leaves; i++) {
>>>           if (ci->info_list[i].level == level) {
>>>               if (ci->info_list[i].attributes & CACHE_ID)
>>> @@ -136,6 +129,18 @@ static inline struct cacheinfo 
>>> *get_cpu_cacheinfo_level(int cpu, int level)
>>>       return NULL;
>>>   }
>>> +/*
>>> + * Get the cacheinfo structure for the cache associated with @cpu at
>>> + * level @level.
>>> + * cpuhp lock must be held.
>>> + */
>>> +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, 
>>> int level)
>>> +{
>>> +    lockdep_assert_cpus_held();
>>> +
>>> +    return _get_cpu_cacheinfo_level(cpu, level);
>>> +}
>>> +
>>>   /*
>>>    * Get the id of the cache associated with @cpu at level @level.
>>>    * cpuhp lock must be held.
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 6afa3f9a4e9b..424ec601cfdf 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1223,6 +1223,38 @@ static int llc_id(int cpu)
>>>       return llc;
>>>   }
>>> +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
>>> +{
>>> +    struct cacheinfo *ci;
>>> +    unsigned long rss;
>>> +    unsigned int llc;
>>> +
>>> +    /*
>>> +     * get_cpu_cacheinfo_level() can not be used
>>> +     * because it requires the cpu_hotplug_lock
>>> +     * to be held. Use _get_cpu_cacheinfo_level()
>>> +     * directly because the 'cpu' can not be
>>> +     * offlined at the moment.
>>> +     */
>>> +    ci = _get_cpu_cacheinfo_level(cpu, 3);
>>> +    if (!ci) {
>>> +        /*
>>> +         * On system without L3 but with shared L2,
>>> +         * L2 becomes the LLC.
>>> +         */
>>> +        ci = _get_cpu_cacheinfo_level(cpu, 2);
>>> +        if (!ci)
>>> +            return true;
>>> +    }
>> Is there must call it one by one for get llc size? a static variable 
>> instead in building sched domain?
>
> I suppose you suggested introducing a per-CPU variable, like 
> percpu(sd_llc_bytes, cpu),
> or something similar to struct cpuinfo_x86.x86_cache_size. I am not 
> sure if the community
> would endorse introducing this variable, given that sched_cache would 
> be its only user.
> We can leave this as an open question.
>
> thanks,
> Chenyu

Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

Posted by K Prateek Nayak 1 month, 2 weeks ago

Hello Vern,

On 12/18/2025 3:12 PM, Vern Hao wrote:
> 
> On 2025/12/18 16:32, Chen, Yu C wrote:
>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>
>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>> From: Chen Yu <yu.c.chen@intel.com>
>>>>
>>>> Prateek and Tingyin reported that memory-intensive workloads (such as
>>>> stream) can saturate memory bandwidth and caches on the preferred LLC
>>>> when sched_cache aggregates too many threads.
>>>>
>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>> exceeds the LLC size, skip cache-aware scheduling.
>>> Restricting RSS prevents many applications from benefiting from this optimization. I believe this restriction should be lifted. For memory- intensive workloads, the optimization may simply yield no gains, but it certainly shouldn't make performance worse. We need to further refine this logic.
>>
>> Memory-intensive workloads may trigger performance regressions when
>> memory bandwidth(from L3 cache to memory controller) is saturated due
> RSS size and bandwidth saturation are not necessarily linked, In my view, the optimization should be robust enough that it doesn't cause a noticeable drop in performance, no matter how large the RSS is.

Easier said than done. I agree RSS size is not a clear indication of
bandwidth saturation. With NUMA Balancing enabled, we can use the
hinting faults to estimate the working set and make decisions but for
systems that do not have NUMA, short of programming some performance
counters, there is no real way to estimate the working set.

Hinting faults are known to cause overheads so enabling them without
NUMA can cause noticeable overheads with no real benefits.

> We need to have a more profound discussion on this.

What do you have in mind?

From where I stand, having the RSS based bailout for now won't make
things worse for these tasks with huge memory reserves and when we can
all agree on some generic method to estimate the working set of a task,
we can always add it into exceed_llc_capacity().

-- 
Thanks and Regards,
Prateek

"Rome wasn't built in a day but they were laying bricks every hour.
 You don't have to build everything you want today, just lay a brick."

  - James Clear

Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

Posted by Vern Hao 1 month, 2 weeks ago

On 2025/12/19 11:14, K Prateek Nayak wrote:
> Hello Vern,
>
> On 12/18/2025 3:12 PM, Vern Hao wrote:
>> On 2025/12/18 16:32, Chen, Yu C wrote:
>>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>>> From: Chen Yu <yu.c.chen@intel.com>
>>>>>
>>>>> Prateek and Tingyin reported that memory-intensive workloads (such as
>>>>> stream) can saturate memory bandwidth and caches on the preferred LLC
>>>>> when sched_cache aggregates too many threads.
>>>>>
>>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>>> exceeds the LLC size, skip cache-aware scheduling.
>>>> Restricting RSS prevents many applications from benefiting from this optimization. I believe this restriction should be lifted. For memory- intensive workloads, the optimization may simply yield no gains, but it certainly shouldn't make performance worse. We need to further refine this logic.
>>> Memory-intensive workloads may trigger performance regressions when
>>> memory bandwidth(from L3 cache to memory controller) is saturated due
>> RSS size and bandwidth saturation are not necessarily linked, In my view, the optimization should be robust enough that it doesn't cause a noticeable drop in performance, no matter how large the RSS is.
> Easier said than done. I agree RSS size is not a clear indication of
> bandwidth saturation. With NUMA Balancing enabled, we can use the
> hinting faults to estimate the working set and make decisions but for
> systems that do not have NUMA, short of programming some performance
> counters, there is no real way to estimate the working set.
I see the challenge, but the reality is that many production workloads 
have large memory footprints and deserve to see performance gains as 
well. In my testing with Chen Yu on STREAM, it's intriguing that the 
performance is fine without |llc_enable| but drops significantly once 
it's turned on.I sincerely hope this situation can be optimized; 
otherwise, we won't be able to utilize these optimizations in 
large-memory scenarios.
>
> Hinting faults are known to cause overheads so enabling them without
> NUMA can cause noticeable overheads with no real benefits.
>
>> We need to have a more profound discussion on this.
> What do you have in mind?
I am wondering if we could address this through alternative approaches, 
such as reducing the migration frequency or preventing excessive task 
stacking within a single LLC. Of course, defining the right metrics to 
evaluate these conditions remains a significant challenge.
>
>  From where I stand, having the RSS based bailout for now won't make
> things worse for these tasks with huge memory reserves and when we can
> all agree on some generic method to estimate the working set of a task,
> we can always add it into exceed_llc_capacity().
>

Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

Posted by Chen, Yu C 1 month, 2 weeks ago

On 12/19/2025 11:14 AM, K Prateek Nayak wrote:
> Hello Vern,
> 
> On 12/18/2025 3:12 PM, Vern Hao wrote:
>>
>> On 2025/12/18 16:32, Chen, Yu C wrote:
>>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>>
>>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>>> From: Chen Yu <yu.c.chen@intel.com>
>>>>>
>>>>> Prateek and Tingyin reported that memory-intensive workloads (such as
>>>>> stream) can saturate memory bandwidth and caches on the preferred LLC
>>>>> when sched_cache aggregates too many threads.
>>>>>
>>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>>> exceeds the LLC size, skip cache-aware scheduling.
>>>> Restricting RSS prevents many applications from benefiting from this optimization. I believe this restriction should be lifted. For memory- intensive workloads, the optimization may simply yield no gains, but it certainly shouldn't make performance worse. We need to further refine this logic.
>>>
>>> Memory-intensive workloads may trigger performance regressions when
>>> memory bandwidth(from L3 cache to memory controller) is saturated due
>> RSS size and bandwidth saturation are not necessarily linked, In my view, the optimization should be robust enough that it doesn't cause a noticeable drop in performance, no matter how large the RSS is.
> 
> Easier said than done. I agree RSS size is not a clear indication of
> bandwidth saturation. With NUMA Balancing enabled, we can use the
> hinting faults to estimate the working set and make decisions but for
> systems that do not have NUMA, short of programming some performance
> counters, there is no real way to estimate the working set.
> 
> Hinting faults are known to cause overheads so enabling them without
> NUMA can cause noticeable overheads with no real benefits.
> 
>> We need to have a more profound discussion on this.
> 
> What do you have in mind?
> 
>  From where I stand, having the RSS based bailout for now won't make
> things worse for these tasks with huge memory reserves and when we can
> all agree on some generic method to estimate the working set of a task,
> we can always add it into exceed_llc_capacity().
>

Prateek, thanks very much for the practical callouts - using RSS seems to be
the best trade-off we can go with for now. Vern, I get your point about the
concern between RSS and actual memory footprint. However, detecting the 
working
set doesn’t seem to be accurate or generic in kernel space - even with
NUMA fault statistics sampling. One reliable way I can think of to
  detect the working set is in user space, via resctrl (Intel RDT, AMD QoS,
Arm MPAM). So maybe we can leverage that information to implement 
fine-grained
control on a per-process or per-task basis later.

thanks,
Chenyu

Re: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes

Posted by Vern Hao 1 month, 2 weeks ago

On 2025/12/19 20:55, Chen, Yu C wrote:
> On 12/19/2025 11:14 AM, K Prateek Nayak wrote:
>> Hello Vern,
>>
>> On 12/18/2025 3:12 PM, Vern Hao wrote:
>>>
>>> On 2025/12/18 16:32, Chen, Yu C wrote:
>>>> On 12/18/2025 11:59 AM, Vern Hao wrote:
>>>>>
>>>>> On 2025/12/4 07:07, Tim Chen wrote:
>>>>>> From: Chen Yu <yu.c.chen@intel.com>
>>>>>>
>>>>>> Prateek and Tingyin reported that memory-intensive workloads 
>>>>>> (such as
>>>>>> stream) can saturate memory bandwidth and caches on the preferred 
>>>>>> LLC
>>>>>> when sched_cache aggregates too many threads.
>>>>>>
>>>>>> To mitigate this, estimate a process's memory footprint by comparing
>>>>>> its RSS (anonymous and shared pages) to the size of the LLC. If RSS
>>>>>> exceeds the LLC size, skip cache-aware scheduling.
>>>>> Restricting RSS prevents many applications from benefiting from 
>>>>> this optimization. I believe this restriction should be lifted. 
>>>>> For memory- intensive workloads, the optimization may simply yield 
>>>>> no gains, but it certainly shouldn't make performance worse. We 
>>>>> need to further refine this logic.
>>>>
>>>> Memory-intensive workloads may trigger performance regressions when
>>>> memory bandwidth(from L3 cache to memory controller) is saturated due
>>> RSS size and bandwidth saturation are not necessarily linked, In my 
>>> view, the optimization should be robust enough that it doesn't cause 
>>> a noticeable drop in performance, no matter how large the RSS is.
>>
>> Easier said than done. I agree RSS size is not a clear indication of
>> bandwidth saturation. With NUMA Balancing enabled, we can use the
>> hinting faults to estimate the working set and make decisions but for
>> systems that do not have NUMA, short of programming some performance
>> counters, there is no real way to estimate the working set.
>>
>> Hinting faults are known to cause overheads so enabling them without
>> NUMA can cause noticeable overheads with no real benefits.
>>
>>> We need to have a more profound discussion on this.
>>
>> What do you have in mind?
>>
>>  From where I stand, having the RSS based bailout for now won't make
>> things worse for these tasks with huge memory reserves and when we can
>> all agree on some generic method to estimate the working set of a task,
>> we can always add it into exceed_llc_capacity().
>>
>
> Prateek, thanks very much for the practical callouts - using RSS seems 
> to be
> the best trade-off we can go with for now. Vern, I get your point 
> about the
> concern between RSS and actual memory footprint. However, detecting 
> the working
> set doesn’t seem to be accurate or generic in kernel space - even with
> NUMA fault statistics sampling. One reliable way I can think of to
>  detect the working set is in user space, via resctrl (Intel RDT, AMD 
> QoS,
> Arm MPAM). So maybe we can leverage that information to implement 
> fine-grained
> control on a per-process or per-task basis later.
OK, I agree, thanks.
>
> thanks,
> Chenyu