[v2] Cache aware scheduling

[PATCH v2 15/23] sched/cache: Respect LLC preference in task migration and detach

Posted by Tim Chen 2 months ago

During the final step of load balancing, can_migrate_task() now
considers a task's LLC preference before moving it out of its
preferred LLC.

Additionally, add checks in detach_tasks() to prevent selecting tasks
that prefer their current LLC.

Co-developed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---

Notes:
    v1->v2: Leave out tasks under core scheduling from the cache aware
            load balance. (K Prateek Nayak)
    
            Reduce the degree of honoring preferred_llc in detach_tasks().
            If certain conditions are met, stop migrating tasks that prefer
            their current LLC and instead continue load balancing from other
            busiest runqueues. (K Prateek Nayak)

 kernel/sched/fair.c  | 63 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h | 13 +++++++++
 2 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dd09a816670e..580a967efdac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9852,8 +9852,8 @@ static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
  * Check if task p can migrate from source LLC to
  * destination LLC in terms of cache aware load balance.
  */
-static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
-							struct task_struct *p)
+static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
+					 struct task_struct *p)
 {
 	struct mm_struct *mm;
 	bool to_pref;
@@ -10025,6 +10025,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (env->flags & LBF_ACTIVE_LB)
 		return 1;
 
+#ifdef CONFIG_SCHED_CACHE
+	if (sched_cache_enabled() &&
+	    can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) == mig_forbid &&
+	    !task_has_sched_core(p))
+		return 0;
+#endif
+
 	degrades = migrate_degrades_locality(p, env);
 	if (!degrades)
 		hot = task_hot(p, env);
@@ -10146,12 +10153,55 @@ static struct list_head
 	list_splice(&pref_old_llc, tasks);
 	return tasks;
 }
+
+static bool stop_migrate_src_rq(struct task_struct *p,
+				struct lb_env *env,
+				int detached)
+{
+	if (!sched_cache_enabled() || p->preferred_llc == -1 ||
+	    cpus_share_cache(env->src_cpu, env->dst_cpu) ||
+	    env->sd->nr_balance_failed)
+		return false;
+
+	/*
+	 * Stop migration for the src_rq and pull from a
+	 * different busy runqueue in the following cases:
+	 *
+	 * 1. Trying to migrate task to its preferred
+	 *    LLC, but the chosen task does not prefer dest
+	 *    LLC - case 3 in order_tasks_by_llc(). This violates
+	 *    the goal of migrate_llc_task. However, we should
+	 *    stop detaching only if some tasks have been detached
+	 *    and the imbalance has been mitigated.
+	 *
+	 * 2. Don't detach more tasks if the remaining tasks want
+	 *    to stay. We know the remaining tasks all prefer the
+	 *    current LLC, because after order_tasks_by_llc(), the
+	 *    tasks that prefer the current LLC are the least favored
+	 *    candidates to be migrated out.
+	 */
+	if (env->migration_type == migrate_llc_task &&
+	    detached && llc_id(env->dst_cpu) != p->preferred_llc)
+		return true;
+
+	if (llc_id(env->src_cpu) == p->preferred_llc)
+		return true;
+
+	return false;
+}
 #else
 static inline struct list_head
 *order_tasks_by_llc(struct lb_env *env, struct list_head *tasks)
 {
 	return tasks;
 }
+
+static bool stop_migrate_src_rq(struct task_struct *p,
+				struct lb_env *env,
+				int detached)
+{
+	return false;
+}
 #endif
 
 /*
@@ -10205,6 +10255,15 @@ static int detach_tasks(struct lb_env *env)
 
 		p = list_last_entry(tasks, struct task_struct, se.group_node);
 
+		/*
+		 * Check if detaching current src_rq should be stopped, because
+		 * doing so would break cache aware load balance. If we stop
+		 * here, the env->flags has LBF_ALL_PINNED, which would cause
+		 * the load balance to pull from another busy runqueue.
+		 */
+		if (stop_migrate_src_rq(p, env, detached))
+			break;
+
 		if (!can_migrate_task(p, env))
 			goto next;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8f2a779825e4..40798a06e058 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1485,6 +1485,14 @@ extern void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags);
 extern void sched_core_get(void);
 extern void sched_core_put(void);
 
+static inline bool task_has_sched_core(struct task_struct *p)
+{
+	if (sched_core_disabled())
+		return false;
+
+	return !!p->core_cookie;
+}
+
 #else /* !CONFIG_SCHED_CORE: */
 
 static inline bool sched_core_enabled(struct rq *rq)
@@ -1524,6 +1532,11 @@ static inline bool sched_group_cookie_match(struct rq *rq,
 	return true;
 }
 
+static inline bool task_has_sched_core(struct task_struct *p)
+{
+	return false;
+}
+
 #endif /* !CONFIG_SCHED_CORE */
 
 #ifdef CONFIG_RT_GROUP_SCHED
-- 
2.32.0

Re: [PATCH v2 15/23] sched/cache: Respect LLC preference in task migration and detach

Posted by Peter Zijlstra 1 month, 4 weeks ago

On Wed, Dec 03, 2025 at 03:07:34PM -0800, Tim Chen wrote:

> @@ -10025,6 +10025,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	if (env->flags & LBF_ACTIVE_LB)
>  		return 1;
>  
> +#ifdef CONFIG_SCHED_CACHE
> +	if (sched_cache_enabled() &&
> +	    can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) == mig_forbid &&
> +	    !task_has_sched_core(p))
> +		return 0;
> +#endif

This seems wrong:
 - it does not let nr_balance_failed override things;
 - it takes precedence over migrate_degrade_locality(); you really want
   to migrate towards the preferred NUMA node over staying on your LLC.

That is, this really wants to be done after migrate_degrades_locality()
and only if degrades == 0 or something.

>  	degrades = migrate_degrades_locality(p, env);
>  	if (!degrades)
>  		hot = task_hot(p, env);
> @@ -10146,12 +10153,55 @@ static struct list_head
>  	list_splice(&pref_old_llc, tasks);
>  	return tasks;
>  }
> +
> +static bool stop_migrate_src_rq(struct task_struct *p,
> +				struct lb_env *env,
> +				int detached)
> +{
> +	if (!sched_cache_enabled() || p->preferred_llc == -1 ||
> +	    cpus_share_cache(env->src_cpu, env->dst_cpu) ||
> +	    env->sd->nr_balance_failed)
> +		return false;

But you are allowing nr_balance_failed to override things here.

> +	/*
> +	 * Stop migration for the src_rq and pull from a
> +	 * different busy runqueue in the following cases:
> +	 *
> +	 * 1. Trying to migrate task to its preferred
> +	 *    LLC, but the chosen task does not prefer dest
> +	 *    LLC - case 3 in order_tasks_by_llc(). This violates
> +	 *    the goal of migrate_llc_task. However, we should
> +	 *    stop detaching only if some tasks have been detached
> +	 *    and the imbalance has been mitigated.
> +	 *
> +	 * 2. Don't detach more tasks if the remaining tasks want
> +	 *    to stay. We know the remaining tasks all prefer the
> +	 *    current LLC, because after order_tasks_by_llc(), the
> +	 *    tasks that prefer the current LLC are the least favored
> +	 *    candidates to be migrated out.
> +	 */
> +	if (env->migration_type == migrate_llc_task &&
> +	    detached && llc_id(env->dst_cpu) != p->preferred_llc)
> +		return true;
> +
> +	if (llc_id(env->src_cpu) == p->preferred_llc)
> +		return true;
> +
> +	return false;
> +}

Also, I think we have a problem with nr_balance_failed, cache_nice_tries
is 1 for SHARE_LLC; this means for failed=0 we ignore:

 - ineligible tasks
 - llc fail
 - node-degrading / hot

and then the very next round, we do all of them at once, without much
grading.

> @@ -10205,6 +10255,15 @@ static int detach_tasks(struct lb_env *env)
>  
>  		p = list_last_entry(tasks, struct task_struct, se.group_node);
>  
> +		/*
> +		 * Check if detaching current src_rq should be stopped, because
> +		 * doing so would break cache aware load balance. If we stop
> +		 * here, the env->flags has LBF_ALL_PINNED, which would cause
> +		 * the load balance to pull from another busy runqueue.

Uhh, can_migrate_task() will clear that ALL_PINNED thing if we've found
at least one task before getting here.

> +		 */
> +		if (stop_migrate_src_rq(p, env, detached))
> +			break;


Perhaps split cfs_tasks into multiple lists from the get-go? That avoids
this sorting.

Re: [PATCH v2 15/23] sched/cache: Respect LLC preference in task migration and detach

Posted by Chen, Yu C 1 month, 3 weeks ago

On 12/11/2025 12:30 AM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:34PM -0800, Tim Chen wrote:
> 
>> @@ -10025,6 +10025,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>>   	if (env->flags & LBF_ACTIVE_LB)
>>   		return 1;
>>   
>> +#ifdef CONFIG_SCHED_CACHE
>> +	if (sched_cache_enabled() &&
>> +	    can_migrate_llc_task(env->src_cpu, env->dst_cpu, p) == mig_forbid &&
>> +	    !task_has_sched_core(p))
>> +		return 0;
>> +#endif
> 
> This seems wrong:
>   - it does not let nr_balance_failed override things;
>   - it takes precedence over migrate_degrade_locality(); you really want
>     to migrate towards the preferred NUMA node over staying on your LLC.
> 
> That is, this really wants to be done after migrate_degrades_locality()
> and only if degrades == 0 or something.
> 

OK, will fix it.

>>   	degrades = migrate_degrades_locality(p, env);
>>   	if (!degrades)
>>   		hot = task_hot(p, env);
>> @@ -10146,12 +10153,55 @@ static struct list_head
>>   	list_splice(&pref_old_llc, tasks);
>>   	return tasks;
>>   }
>> +
>> +static bool stop_migrate_src_rq(struct task_struct *p,
>> +				struct lb_env *env,
>> +				int detached)
>> +{
>> +	if (!sched_cache_enabled() || p->preferred_llc == -1 ||
>> +	    cpus_share_cache(env->src_cpu, env->dst_cpu) ||
>> +	    env->sd->nr_balance_failed)
>> +		return false;
> 
> But you are allowing nr_balance_failed to override things here.
> 
>> +	/*
>> +	 * Stop migration for the src_rq and pull from a
>> +	 * different busy runqueue in the following cases:
>> +	 *
>> +	 * 1. Trying to migrate task to its preferred
>> +	 *    LLC, but the chosen task does not prefer dest
>> +	 *    LLC - case 3 in order_tasks_by_llc(). This violates
>> +	 *    the goal of migrate_llc_task. However, we should
>> +	 *    stop detaching only if some tasks have been detached
>> +	 *    and the imbalance has been mitigated.
>> +	 *
>> +	 * 2. Don't detach more tasks if the remaining tasks want
>> +	 *    to stay. We know the remaining tasks all prefer the
>> +	 *    current LLC, because after order_tasks_by_llc(), the
>> +	 *    tasks that prefer the current LLC are the least favored
>> +	 *    candidates to be migrated out.
>> +	 */
>> +	if (env->migration_type == migrate_llc_task &&
>> +	    detached && llc_id(env->dst_cpu) != p->preferred_llc)
>> +		return true;
>> +
>> +	if (llc_id(env->src_cpu) == p->preferred_llc)
>> +		return true;
>> +
>> +	return false;
>> +}
> 
> Also, I think we have a problem with nr_balance_failed, cache_nice_tries
> is 1 for SHARE_LLC; this means for failed=0 we ignore:
> 
>   - ineligible tasks
>   - llc fail
>   - node-degrading / hot
> 
> and then the very next round, we do all of them at once, without much
> grading.
> 

Do you mean we can set different thresholds for the different
scenarios you mentioned above, so as to avoid migrating tasks
at the same time in detach_tasks()?

For example,

ineligible tasks check:
if (env->sd->nr_balance_failed > env->sd->cache_nice_tries)
     can_migrate;

llc fail check:
if (env->sd->nr_balance_failed > env->sd->cache_nice_tries + 1)
     can_migrate;

node-degrading/hot check:
if (env->sd->nr_balance_failed > env->sd->cache_nice_tries + 2)
     can_migrate;


>> @@ -10205,6 +10255,15 @@ static int detach_tasks(struct lb_env *env)
>>   
>>   		p = list_last_entry(tasks, struct task_struct, se.group_node);
>>   
>> +		/*
>> +		 * Check if detaching current src_rq should be stopped, because
>> +		 * doing so would break cache aware load balance. If we stop
>> +		 * here, the env->flags has LBF_ALL_PINNED, which would cause
>> +		 * the load balance to pull from another busy runqueue.
> 
> Uhh, can_migrate_task() will clear that ALL_PINNED thing if we've found
> at least one task before getting here.
> 

One problem is that, LBF_ALL_PINNED was cleared before
migrate_degrades_locality()/can_migrate_llc_task() in detach_tasks().
I suppose we want to keep LBF_ALL_PINNED() if can_migrate_llc_task(break
llc locality) failed.

>> +		 */
>> +		if (stop_migrate_src_rq(p, env, detached))
>> +			break;
> 
> 
> Perhaps split cfs_tasks into multiple lists from the get-go? That avoids
> this sorting.

Will check with Tim on this.

thanks,
Chenyu