From nobody Sat Apr 11 13:57:19 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CBF95C00140 for ; Wed, 10 Aug 2022 09:26:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232048AbiHJJ0C (ORCPT ); Wed, 10 Aug 2022 05:26:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51852 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230429AbiHJJ0A (ORCPT ); Wed, 10 Aug 2022 05:26:00 -0400 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1A28357263 for ; Wed, 10 Aug 2022 02:25:59 -0700 (PDT) Received: from dggpemm500022.china.huawei.com (unknown [172.30.72.55]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4M2kxf0ynfzlVyq; Wed, 10 Aug 2022 17:23:02 +0800 (CST) Received: from dggpemm500014.china.huawei.com (7.185.36.153) by dggpemm500022.china.huawei.com (7.185.36.162) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Wed, 10 Aug 2022 17:25:55 +0800 Received: from huawei.com (7.220.126.23) by dggpemm500014.china.huawei.com (7.185.36.153) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Wed, 10 Aug 2022 17:25:54 +0800 From: zhangsong To: , , , CC: , , , , , , , zhangsong Subject: [PATCH v3] sched/fair: Introduce priority load balance to reduce interference from IDLE tasks Date: Wed, 10 Aug 2022 17:25:46 +0800 Message-ID: <20220810092546.3901325-1-zhangsong34@huawei.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [7.220.126.23] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemm500014.china.huawei.com (7.185.36.153) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" For co-location with NORMAL and IDLE tasks, when CFS trigger load balance, it is reasonable to prefer migrating NORMAL(Latency Sensitive) tasks from the busy src CPU to dst CPU, and migrating IDLE tasks lastly. Consider the situation that CPU A has several normal tasks and hundreds of idle tasks while CPU B is idle, and CPU B needs to pull some tasks from CPU A, but the cfs_tasks in CPU A are not in order of priority, and the max number of pulling tasks depends on env->loop_max, which value is sysctl_sched_nr_migrate, i.e. 32. We now cannot guarantee that CPU B can pull a certain number of normal tasks instead of idle tasks from the waiting queue of CPU A. So it is necessary to divide cfs_tasks into two different lists and ensure that tasks in none-idle list can be migrated firstly. This is very important for reducing interference from IDLE tasks. So the CFS load balance can be optimized to below: 1.`cfs_tasks` list of CPU rq is owned by NORMAL tasks. 2.`cfs_idle_tasks` list of CPU rq which is owned by IDLE tasks. 3.Prefer to migrate NORMAL tasks of cfs_tasks to dst CPU. 4.Lastly migrate IDLE tasks of cfs_idle_tasks to dst CPU. This was tested with the following reproduction: - small number of NORMAL tasks colocated with a large number of IDLE tasks With this patch, NORMAL tasks latency can be reduced about 5~10% compared with current. Signed-off-by: zhangsong --- V2->V3: - rename variable loop(int) to has_detach_cfs_idle(bool) and make it more readable - add more description for priority load balance --- kernel/sched/core.c | 1 + kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++---- kernel/sched/sched.h | 1 + 3 files changed, 45 insertions(+), 4 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ee28253c9ac0..7325c6e552d8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9733,6 +9733,7 @@ void __init sched_init(void) rq->max_idle_balance_cost =3D sysctl_sched_migration_cost; =20 INIT_LIST_HEAD(&rq->cfs_tasks); + INIT_LIST_HEAD(&rq->cfs_idle_tasks); =20 rq_attach_root(rq, &def_root_domain); #ifdef CONFIG_NO_HZ_COMMON diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 914096c5b1ae..189c2b3131ec 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3034,6 +3034,21 @@ static inline void update_scan_period(struct task_st= ruct *p, int new_cpu) =20 #endif /* CONFIG_NUMA_BALANCING */ =20 +#ifdef CONFIG_SMP +static void +adjust_rq_cfs_tasks(void (*list_op)(struct list_head *, struct list_head *= ), + struct rq *rq, + struct sched_entity *se) +{ + struct cfs_rq *cfs_rq =3D cfs_rq_of(se); + + if (task_has_idle_policy(task_of(se)) || tg_is_idle(cfs_rq->tg)) + (*list_op)(&se->group_node, &rq->cfs_idle_tasks); + else + (*list_op)(&se->group_node, &rq->cfs_tasks); +} +#endif + static void account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) { @@ -3043,7 +3058,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct = sched_entity *se) struct rq *rq =3D rq_of(cfs_rq); =20 account_numa_enqueue(rq, task_of(se)); - list_add(&se->group_node, &rq->cfs_tasks); + adjust_rq_cfs_tasks(list_add, rq, se); } #endif cfs_rq->nr_running++; @@ -7465,7 +7480,7 @@ done: __maybe_unused; * the list, so our cfs_tasks list becomes MRU * one. */ - list_move(&p->se.group_node, &rq->cfs_tasks); + adjust_rq_cfs_tasks(list_move, rq, &p->se); #endif =20 if (hrtick_enabled_fair(rq)) @@ -7788,6 +7803,9 @@ static int task_hot(struct task_struct *p, struct lb_= env *env) if (unlikely(task_has_idle_policy(p))) return 0; =20 + if (tg_is_idle(cfs_rq_of(&p->se)->tg)) + return 0; + /* SMT siblings share cache */ if (env->sd->flags & SD_SHARE_CPUCAPACITY) return 0; @@ -7800,6 +7818,11 @@ static int task_hot(struct task_struct *p, struct lb= _env *env) &p->se =3D=3D cfs_rq_of(&p->se)->last)) return 1; =20 + /* Preempt sched idle cpu do not consider migration cost */ + if (cpus_share_cache(env->src_cpu, env->dst_cpu) && + sched_idle_cpu(env->dst_cpu)) + return 0; + if (sysctl_sched_migration_cost =3D=3D -1) return 1; =20 @@ -7990,11 +8013,14 @@ static void detach_task(struct task_struct *p, stru= ct lb_env *env) static struct task_struct *detach_one_task(struct lb_env *env) { struct task_struct *p; + struct list_head *tasks =3D &env->src_rq->cfs_tasks; + bool has_detach_cfs_idle =3D false; =20 lockdep_assert_rq_held(env->src_rq); =20 +again: list_for_each_entry_reverse(p, - &env->src_rq->cfs_tasks, se.group_node) { + tasks, se.group_node) { if (!can_migrate_task(p, env)) continue; =20 @@ -8009,6 +8035,11 @@ static struct task_struct *detach_one_task(struct lb= _env *env) schedstat_inc(env->sd->lb_gained[env->idle]); return p; } + if (!has_detach_cfs_idle) { + has_detach_cfs_idle =3D true; + tasks =3D &env->src_rq->cfs_idle_tasks; + goto again; + } return NULL; } =20 @@ -8026,6 +8057,7 @@ static int detach_tasks(struct lb_env *env) unsigned long util, load; struct task_struct *p; int detached =3D 0; + bool has_detach_cfs_idle =3D false; =20 lockdep_assert_rq_held(env->src_rq); =20 @@ -8041,6 +8073,7 @@ static int detach_tasks(struct lb_env *env) if (env->imbalance <=3D 0) return 0; =20 +again: while (!list_empty(tasks)) { /* * We don't want to steal all, otherwise we may be treated likewise, @@ -8142,6 +8175,12 @@ static int detach_tasks(struct lb_env *env) list_move(&p->se.group_node, tasks); } =20 + if (env->imbalance > 0 && !has_detach_cfs_idle) { + has_detach_cfs_idle =3D true; + tasks =3D &env->src_rq->cfs_idle_tasks; + goto again; + } + /* * Right now, this is one of only two places we collect this stat * so we can safely collect detach_one_task() stats here rather @@ -11643,7 +11682,7 @@ static void set_next_task_fair(struct rq *rq, struc= t task_struct *p, bool first) * Move the next running task to the front of the list, so our * cfs_tasks list becomes MRU one. */ - list_move(&se->group_node, &rq->cfs_tasks); + adjust_rq_cfs_tasks(list_move, rq, se); } #endif =20 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index e26688d387ae..accb4eea9769 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1068,6 +1068,7 @@ struct rq { int online; =20 struct list_head cfs_tasks; + struct list_head cfs_idle_tasks; =20 struct sched_avg avg_rt; struct sched_avg avg_dl; --=20 2.27.0