[PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling

Tim Chen posted 23 patches 2 weeks, 1 day ago
There is a newer version of this series
[PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling
Posted by Tim Chen 2 weeks, 1 day ago
From: Chen Yu <yu.c.chen@intel.com>

A performance regression was observed by Prateek when running hackbench
with many threads per process (high fd count). To avoid this, processes
with a large number of active threads are excluded from cache-aware
scheduling.

With sched_cache enabled, record the number of active threads in each
process during the periodic task_cache_work(). While iterating over
CPUs, if the currently running task belongs to the same process as the
task that launched task_cache_work(), increment the active thread count.

This number will be used by subsequent patch to inhibit cache aware
load balance.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: No change.

 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 11 +++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 1ea16ef90566..04743983de4d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1043,6 +1043,7 @@ struct mm_struct {
 		raw_spinlock_t mm_sched_lock;
 		unsigned long mm_sched_epoch;
 		int mm_sched_cpu;
+		u64 nr_running_avg ____cacheline_aligned_in_smp;
 #endif
 
 #ifdef CONFIG_MMU
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 580a967efdac..2f38ad82688f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1421,11 +1421,11 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
 
 static void __no_profile task_cache_work(struct callback_head *work)
 {
-	struct task_struct *p = current;
+	struct task_struct *p = current, *cur;
 	struct mm_struct *mm = p->mm;
 	unsigned long m_a_occ = 0;
 	unsigned long curr_m_a_occ = 0;
-	int cpu, m_a_cpu = -1;
+	int cpu, m_a_cpu = -1, nr_running = 0;
 	cpumask_var_t cpus;
 
 	WARN_ON_ONCE(work != &p->cache_work);
@@ -1458,6 +1458,12 @@ static void __no_profile task_cache_work(struct callback_head *work)
 					m_occ = occ;
 					m_cpu = i;
 				}
+				rcu_read_lock();
+				cur = rcu_dereference(cpu_rq(i)->curr);
+				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
+				    cur->mm == mm)
+					nr_running++;
+				rcu_read_unlock();
 			}
 
 			/*
@@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
 		mm->mm_sched_cpu = m_a_cpu;
 	}
 
+	update_avg(&mm->nr_running_avg, nr_running);
 	free_cpumask_var(cpus);
 }
 
-- 
2.32.0
Re: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling
Posted by Aaron Lu 2 days, 9 hours ago
On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
> @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>  		mm->mm_sched_cpu = m_a_cpu;
>  	}
>  
> +	update_avg(&mm->nr_running_avg, nr_running);

update_avg() doesn't appear to deal with small numbers well and can have
an error as large as 7, e.g. when nr_running < 8, nr_running_avg will
always be 0 and when nr_running >= 8 && < 16, nr_running_avg will be
1 - 8, etc.

AMD Genoa has 8 cores per LLC and this will break exceed_llc_nr() there.

>  	free_cpumask_var(cpus);
>  }
Re: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling
Posted by Chen, Yu C 2 days, 6 hours ago
On 12/17/2025 5:40 PM, Aaron Lu wrote:
> On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
>> @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>>   		mm->mm_sched_cpu = m_a_cpu;
>>   	}
>>   
>> +	update_avg(&mm->nr_running_avg, nr_running);
> 
> update_avg() doesn't appear to deal with small numbers well and can have
> an error as large as 7, e.g. when nr_running < 8, nr_running_avg will
> always be 0 and when nr_running >= 8 && < 16, nr_running_avg will be
> 1 - 8, etc.
> 
> AMD Genoa has 8 cores per LLC and this will break exceed_llc_nr() there.
> 

Ah, you are right, thanks for pointing this out, dividing by 8 would make
convergence slow for small LLC system. Maybe consider the number of Cores
in the LLC, the smaller the number is, the more we should honor the diff
between two invoking of update_avg()?

static inline void sched_cache_update_avg(u64 *avg, u64 sample)
{
	s64 diff = sample - *avg;
	u32 divisor = clamp_t(u32, nr_cores_llc/4, 2, 8);

	*avg += diff / divisor;
}

For <=8 cores per LLC, the divisor is 2,
for 16 cores per LLC, the divisor is 4,
for >=32 cores per LLC, the divisor is 8

Thanks,
Chenyu

>>   	free_cpumask_var(cpus);
>>   }
Re: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling
Posted by Aaron Lu 16 hours ago
On Wed, Dec 17, 2025 at 08:51:50PM +0800, Chen, Yu C wrote:
> On 12/17/2025 5:40 PM, Aaron Lu wrote:
> > On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
> > > @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
> > >   		mm->mm_sched_cpu = m_a_cpu;
> > >   	}
> > > +	update_avg(&mm->nr_running_avg, nr_running);
> > 
> > update_avg() doesn't appear to deal with small numbers well and can have
> > an error as large as 7, e.g. when nr_running < 8, nr_running_avg will
> > always be 0 and when nr_running >= 8 && < 16, nr_running_avg will be
> > 1 - 8, etc.
> > 
> > AMD Genoa has 8 cores per LLC and this will break exceed_llc_nr() there.
> > 
> 
> Ah, you are right, thanks for pointing this out, dividing by 8 would make
> convergence slow for small LLC system. Maybe consider the number of Cores

Not just slow but the error is too large for a small LLC.

> in the LLC, the smaller the number is, the more we should honor the diff
> between two invoking of update_avg()?
> 
> static inline void sched_cache_update_avg(u64 *avg, u64 sample)
> {
> 	s64 diff = sample - *avg;
> 	u32 divisor = clamp_t(u32, nr_cores_llc/4, 2, 8);
> 
> 	*avg += diff / divisor;
> }
> 
> For <=8 cores per LLC, the divisor is 2,
> for 16 cores per LLC, the divisor is 4,
> for >=32 cores per LLC, the divisor is 8

Yeah I guess it works. The error can be as large as 'divisor - 1' but
since this avg is an estimate, it may be OK.
Re: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling
Posted by Peter Zijlstra 1 week, 2 days ago
On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@intel.com>
> 
> A performance regression was observed by Prateek when running hackbench
> with many threads per process (high fd count). To avoid this, processes
> with a large number of active threads are excluded from cache-aware
> scheduling.
> 
> With sched_cache enabled, record the number of active threads in each
> process during the periodic task_cache_work(). While iterating over
> CPUs, if the currently running task belongs to the same process as the
> task that launched task_cache_work(), increment the active thread count.
> 
> This number will be used by subsequent patch to inhibit cache aware
> load balance.
> 
> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> 
> Notes:
>     v1->v2: No change.
> 
>  include/linux/mm_types.h |  1 +
>  kernel/sched/fair.c      | 11 +++++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 1ea16ef90566..04743983de4d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1043,6 +1043,7 @@ struct mm_struct {
>  		raw_spinlock_t mm_sched_lock;
>  		unsigned long mm_sched_epoch;
>  		int mm_sched_cpu;
> +		u64 nr_running_avg ____cacheline_aligned_in_smp;

This is unlikely to do what you hope it does, it will place this
variable on a new cacheline, but will not ensure this variable is the
only one in that line. Notably ogtables_bytes (the next field in this
structure) will share the line.

It might all be less dodgy if you stick these here fields in their own
structure, a little like mm_mm_cid or so.

>  #endif
>  
>  #ifdef CONFIG_MMU
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 580a967efdac..2f38ad82688f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1421,11 +1421,11 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
>  
>  static void __no_profile task_cache_work(struct callback_head *work)
>  {
> -	struct task_struct *p = current;
> +	struct task_struct *p = current, *cur;
>  	struct mm_struct *mm = p->mm;
>  	unsigned long m_a_occ = 0;
>  	unsigned long curr_m_a_occ = 0;
> -	int cpu, m_a_cpu = -1;
> +	int cpu, m_a_cpu = -1, nr_running = 0;
>  	cpumask_var_t cpus;
>  
>  	WARN_ON_ONCE(work != &p->cache_work);
> @@ -1458,6 +1458,12 @@ static void __no_profile task_cache_work(struct callback_head *work)
>  					m_occ = occ;
>  					m_cpu = i;
>  				}

	guard(rcu)();

> +				rcu_read_lock();
> +				cur = rcu_dereference(cpu_rq(i)->curr);
> +				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
> +				    cur->mm == mm)
> +					nr_running++;
> +				rcu_read_unlock();
>  			}
>  
>  			/*
> @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>  		mm->mm_sched_cpu = m_a_cpu;
>  	}
>  
> +	update_avg(&mm->nr_running_avg, nr_running);
>  	free_cpumask_var(cpus);
>  }

Its a wee bit weird to introduce nr_running_avg without its user. Makes
it hard to see what's what.
Re: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling
Posted by Chen, Yu C 3 days, 11 hours ago
On 12/11/2025 12:51 AM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> A performance regression was observed by Prateek when running hackbench
>> with many threads per process (high fd count). To avoid this, processes
>> with a large number of active threads are excluded from cache-aware
>> scheduling.
>>
>> With sched_cache enabled, record the number of active threads in each
>> process during the periodic task_cache_work(). While iterating over
>> CPUs, if the currently running task belongs to the same process as the
>> task that launched task_cache_work(), increment the active thread count.
>>
>> This number will be used by subsequent patch to inhibit cache aware
>> load balance.
>>
>> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>
>> Notes:
>>      v1->v2: No change.
>>
>>   include/linux/mm_types.h |  1 +
>>   kernel/sched/fair.c      | 11 +++++++++--
>>   2 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 1ea16ef90566..04743983de4d 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1043,6 +1043,7 @@ struct mm_struct {
>>   		raw_spinlock_t mm_sched_lock;
>>   		unsigned long mm_sched_epoch;
>>   		int mm_sched_cpu;
>> +		u64 nr_running_avg ____cacheline_aligned_in_smp;
> 
> This is unlikely to do what you hope it does, it will place this
> variable on a new cacheline, but will not ensure this variable is the
> only one in that line. Notably ogtables_bytes (the next field in this
> structure) will share the line.
> 
> It might all be less dodgy if you stick these here fields in their own
> structure, a little like mm_mm_cid or so.
> 

Got it, will do.

>>   #endif
>>   
>>   #ifdef CONFIG_MMU
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 580a967efdac..2f38ad82688f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1421,11 +1421,11 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
>>   
>>   static void __no_profile task_cache_work(struct callback_head *work)
>>   {
>> -	struct task_struct *p = current;
>> +	struct task_struct *p = current, *cur;
>>   	struct mm_struct *mm = p->mm;
>>   	unsigned long m_a_occ = 0;
>>   	unsigned long curr_m_a_occ = 0;
>> -	int cpu, m_a_cpu = -1;
>> +	int cpu, m_a_cpu = -1, nr_running = 0;
>>   	cpumask_var_t cpus;
>>   
>>   	WARN_ON_ONCE(work != &p->cache_work);
>> @@ -1458,6 +1458,12 @@ static void __no_profile task_cache_work(struct callback_head *work)
>>   					m_occ = occ;
>>   					m_cpu = i;
>>   				}
> 
> 	guard(rcu)();
> 

OK.

>> +				rcu_read_lock();
>> +				cur = rcu_dereference(cpu_rq(i)->curr);
>> +				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
>> +				    cur->mm == mm)
>> +					nr_running++;
>> +				rcu_read_unlock();
>>   			}
>>   
>>   			/*
>> @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>>   		mm->mm_sched_cpu = m_a_cpu;
>>   	}
>>   
>> +	update_avg(&mm->nr_running_avg, nr_running);
>>   	free_cpumask_var(cpus);
>>   }
> 
> Its a wee bit weird to introduce nr_running_avg without its user. Makes
> it hard to see what's what.

OK, will put the user together.

thanks,
Chenyu