From nobody Mon May 11 08:32:02 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1EBA4C433F5 for ; Sat, 9 Apr 2022 13:52:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242245AbiDINyM (ORCPT ); Sat, 9 Apr 2022 09:54:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54294 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234964AbiDINyK (ORCPT ); Sat, 9 Apr 2022 09:54:10 -0400 Received: from mail-pg1-x532.google.com (mail-pg1-x532.google.com [IPv6:2607:f8b0:4864:20::532]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CFB9F2558C for ; Sat, 9 Apr 2022 06:52:03 -0700 (PDT) Received: by mail-pg1-x532.google.com with SMTP id q19so10190679pgm.6 for ; Sat, 09 Apr 2022 06:52:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=dhWe9vjYU7p/PCLLznOYW7dtirlxHpMtsOxO7api5GQ=; b=zQpvs/eZxOUqkdUs5/v6hJ79Sy7Eqm9erMDdiP8J54oULUKj0CeIE4NP8bDd7Bng3z H1i8AhBhDqGITtLNdvHUe/rAQFNpk+jRCCQfz8+kAcY0vkDoyf8y3I32KU1UmjrBd1dE pASPyOTdVhehT8Q1XenPMyCF6XsHVV7N951/PsY2fmLtJNvkI0YEL9OPWx9hRzm+chpD 4jMDxXtipL33q3NMmCwGTWXjFkftrirxWEf2E/coOQfQjzvc/riwAQupMd1ZeyIk35+r MDXdl+BT0gm+WhBuvQxHuav4IWqnQ713vginv1Qlww1T/FpwJtD0bXnkJ7SvJbYcoaXu HbXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=dhWe9vjYU7p/PCLLznOYW7dtirlxHpMtsOxO7api5GQ=; b=FBNPWDew1h6uQ5rdOMwBpNRG0Rg5joQjuE0UNbJCduR0QhCjJSuMMAeV+vN4anEGBF wdtL8TTgCjCs0wnGBrr15lBoaV+fHtHg8+Xy++sj6f9Tx2zhzTVoj6poPCMaz6jgLtbf WOURpfkt8wqKFJnH50jUqc8WiuDfHTs658NwEl7ZHYJgWIxBY2/BOTFYZX4eIUp1XT2A PENjPcjFr4O5jzHnVrh9Gvh4cunPM0QiUL+V9qKcz7qdRt4fLxVTOVP9s7fx7rqlmpww 72w/sPgA1KJRl/Lh60eeItN1xCY4iP2nXtEbfcpZVjn4tDI+XbEyB1NzQ15RkrgqDB0G 0rDA== X-Gm-Message-State: AOAM531ul/B6CsxZuQUHsFAUJAIUUhUspsgl4rdSzuNbRuox4YjpBc6E KvSMbq+m+tdzMxOCitg4PuKwYQ== X-Google-Smtp-Source: ABdhPJyOBAM9kTlLTuvz+cykOyavt1JWXrQITfrq/pb/dJ1r87L0WnSpOfs0O9Z4xn1XF9DCCHIjdQ== X-Received: by 2002:a63:ec46:0:b0:381:81c4:ebbd with SMTP id r6-20020a63ec46000000b0038181c4ebbdmr19564561pgj.534.1649512323316; Sat, 09 Apr 2022 06:52:03 -0700 (PDT) Received: from n227-010-195.byted.org ([121.30.179.44]) by smtp.gmail.com with ESMTPSA id f19-20020a056a00229300b004fb157f136asm30303357pfe.153.2022.04.09.06.52.01 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 09 Apr 2022 06:52:02 -0700 (PDT) From: Abel Wu To: Peter Zijlstra , Mel Gorman , Vincent Guittot Cc: joshdon@google.com, linux-kernel@vger.kernel.org, Abel Wu Subject: [RFC v2 1/2] sched/fair: filter out overloaded cpus in SIS Date: Sat, 9 Apr 2022 21:51:03 +0800 Message-Id: <20220409135104.3733193-2-wuyun.abel@bytedance.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20220409135104.3733193-1-wuyun.abel@bytedance.com> References: <20220409135104.3733193-1-wuyun.abel@bytedance.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" It would bring benefit if the unoccupied cpus (sched-idle/idle cpus) can start serving as soon as the non-idle tasks are available. Lots of effort has already done, and task wakeup path is one of them. When a task is woken up, the scheduler tends to put it on an unoccupied cpu to make full use of cpu capacity. But due to scalability issues, the search depth is bounded to a reasonable limit. IOW it's possible that a task is woken up on a busy cpu while unoccupied cpus are still out there. This patch focuses on improving the SIS searching efficiency by filtering out the overloaded cpus, so as a result the more overloaded the system is, the less cpus we will search. Signed-off-by: Abel Wu --- include/linux/sched/topology.h | 12 ++++++++ kernel/sched/core.c | 1 + kernel/sched/fair.c | 65 ++++++++++++++++++++++++++++++++++++++= +++- kernel/sched/sched.h | 6 ++++ kernel/sched/topology.c | 4 ++- 5 files changed, 86 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 56cffe42abbc..fb35a1983568 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -81,6 +81,18 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; + + /* + * The state of overloaded cpus is for different use against + * the above elements and they are all hot, so start a new + * cacheline to avoid false sharing. + */ + atomic_t nr_overloaded ____cacheline_aligned; + + /* + * Must be last + */ + unsigned long overloaded[]; }; =20 struct sched_domain { diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ef946123e9af..a372881f8eaf 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9495,6 +9495,7 @@ void __init sched_init(void) rq->wake_stamp =3D jiffies; rq->wake_avg_idle =3D rq->avg_idle; rq->max_idle_balance_cost =3D sysctl_sched_migration_cost; + rq->overloaded =3D 0; =20 INIT_LIST_HEAD(&rq->cfs_tasks); =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 16874e112fe6..fbeb05321615 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6284,6 +6284,15 @@ static inline int select_idle_smt(struct task_struct= *p, struct sched_domain *sd #endif /* CONFIG_SCHED_SMT */ =20 /* + * It would be very unlikely to find an unoccupied cpu when system is heav= ily + * overloaded. Even if we could, the cost might bury the benefit. + */ +static inline bool sched_domain_overloaded(struct sched_domain *sd, int nr= _overloaded) +{ + return nr_overloaded > sd->span_weight - (sd->span_weight >> 4); +} + +/* * Scan the LLC domain for idle CPUs; this is dynamically regulated by * comparing the average scan cost (tracked in sd->avg_scan_cost) against = the * average idle time for this rq (as found in rq->avg_idle). @@ -6291,7 +6300,7 @@ static inline int select_idle_smt(struct task_struct = *p, struct sched_domain *sd static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd,= bool has_idle_core, int target) { struct cpumask *cpus =3D this_cpu_cpumask_var_ptr(select_idle_mask); - int i, cpu, idle_cpu =3D -1, nr =3D INT_MAX; + int i, cpu, idle_cpu =3D -1, nr =3D INT_MAX, nro; struct rq *this_rq =3D this_rq(); int this =3D smp_processor_id(); struct sched_domain *this_sd; @@ -6301,7 +6310,13 @@ static int select_idle_cpu(struct task_struct *p, st= ruct sched_domain *sd, bool if (!this_sd) return -1; =20 + nro =3D atomic_read(&sd->shared->nr_overloaded); + if (sched_domain_overloaded(sd, nro)) + return -1; + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr); + if (nro) + cpumask_andnot(cpus, cpus, sdo_mask(sd->shared)); =20 if (sched_feat(SIS_PROP) && !has_idle_core) { u64 avg_cost, avg_idle, span_avg; @@ -7018,6 +7033,51 @@ balance_fair(struct rq *rq, struct task_struct *prev= , struct rq_flags *rf) =20 return newidle_balance(rq, rf) !=3D 0; } + +static inline bool cfs_rq_overloaded(struct rq *rq) +{ + return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1; +} + +/* + * Use locality-friendly rq->overloaded to cache the status of the rq + * to minimize the heavy cost on LLC shared data. + * + * Must be called with rq locked + */ +static void update_overload_status(struct rq *rq) +{ + struct sched_domain_shared *sds; + bool overloaded =3D cfs_rq_overloaded(rq); + int cpu =3D cpu_of(rq); + + lockdep_assert_rq_held(rq); + + if (rq->overloaded =3D=3D overloaded) + return; + + rcu_read_lock(); + sds =3D rcu_dereference(per_cpu(sd_llc_shared, cpu)); + if (unlikely(!sds)) + goto unlock; + + if (overloaded) { + cpumask_set_cpu(cpu, sdo_mask(sds)); + atomic_inc(&sds->nr_overloaded); + } else { + cpumask_clear_cpu(cpu, sdo_mask(sds)); + atomic_dec(&sds->nr_overloaded); + } + + rq->overloaded =3D overloaded; +unlock: + rcu_read_unlock(); +} + +#else + +static inline void update_overload_status(struct rq *rq) { } + #endif /* CONFIG_SMP */ =20 static unsigned long wakeup_gran(struct sched_entity *se) @@ -7365,6 +7425,8 @@ done: __maybe_unused; if (new_tasks > 0) goto again; =20 + update_overload_status(rq); + /* * rq is about to be idle, check if we need to update the * lost_idle_time of clock_pelt @@ -11183,6 +11245,7 @@ static void task_tick_fair(struct rq *rq, struct ta= sk_struct *curr, int queued) if (static_branch_unlikely(&sched_numa_balancing)) task_tick_numa(rq, curr); =20 + update_overload_status(rq); update_misfit_status(curr, rq); update_overutilized_status(task_rq(curr)); =20 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3da5718cd641..afa1bb68c3ec 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1012,6 +1012,7 @@ struct rq { =20 unsigned char nohz_idle_balance; unsigned char idle_balance; + unsigned char overloaded; =20 unsigned long misfit_task_load; =20 @@ -1764,6 +1765,11 @@ static inline struct sched_domain *lowest_flag_domai= n(int cpu, int flag) return sd; } =20 +static inline struct cpumask *sdo_mask(struct sched_domain_shared *sds) +{ + return to_cpumask(sds->overloaded); +} + DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); DECLARE_PER_CPU(int, sd_llc_size); DECLARE_PER_CPU(int, sd_llc_id); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 32841c6741d1..fea1294ebd16 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1621,6 +1621,8 @@ sd_init(struct sched_domain_topology_level *tl, sd->shared =3D *per_cpu_ptr(sdd->sds, sd_id); atomic_inc(&sd->shared->ref); atomic_set(&sd->shared->nr_busy_cpus, sd_weight); + atomic_set(&sd->shared->nr_overloaded, 0); + cpumask_clear(sdo_mask(sd->shared)); } =20 sd->private =3D sdd; @@ -2086,7 +2088,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map) =20 *per_cpu_ptr(sdd->sd, j) =3D sd; =20 - sds =3D kzalloc_node(sizeof(struct sched_domain_shared), + sds =3D kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(= ), GFP_KERNEL, cpu_to_node(j)); if (!sds) return -ENOMEM; --=20 2.11.0 From nobody Mon May 11 08:32:02 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E40D7C433EF for ; Sat, 9 Apr 2022 13:52:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242261AbiDINy0 (ORCPT ); Sat, 9 Apr 2022 09:54:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54904 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242260AbiDINyR (ORCPT ); Sat, 9 Apr 2022 09:54:17 -0400 Received: from mail-pf1-x42e.google.com (mail-pf1-x42e.google.com [IPv6:2607:f8b0:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 95FAE9E9DA for ; Sat, 9 Apr 2022 06:52:09 -0700 (PDT) Received: by mail-pf1-x42e.google.com with SMTP id z16so10789341pfh.3 for ; Sat, 09 Apr 2022 06:52:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=F7+54fI+N/qxFO9vVcmvSii7KGTDscwJajxT6dy70+I=; b=yMQgNpI491m5t6rJlJqTP6eEpEfw9kFNo6NybotzpFtC/+uAMcddh94rK6EtY7Toy2 pKbFfhLb5/nhzmEfRZxi+FWSgXm9iKM4Qu9r7QnrFT+i31VOUTAV6a+xZDsGwMUXGzDC x5z5I0lMtP5vj7iXtvgb0ZtN8Fu6snJorK0seiXMwF2WtP0FlrkK7RkV2pforQCVG/LP C8ihhnbu8cWeezxBIhcYe8PripjmAr9NB2Ld4fcttYoaAKr6dA4TRTs9VeYtp4nMYWHx /MLy73hsu8FeHSe2ZnotK+1aj63jBOXnNxWgNdFKnTIQROxAZIFeeafh7F5iTN3BWb5c 1oMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=F7+54fI+N/qxFO9vVcmvSii7KGTDscwJajxT6dy70+I=; b=lhBOF0qB8TlTOSCteD/ACijhBdMgUg5++xmMz33H5CjbaWZR3bW/GaMiERt9quZQdN lJFx4byC6TA2nSD4vVuS5NIVDpD+44i7/bIF4JV43MoUcwxBzQ1rDa+5T+WlU0kTD99K 07wZwa5AbMTCHCttzrtQQ1ZIH8EXt0CWpKGiBpHJuOw37jplJhAApAsfEzwlcuMnS1F1 rMArGibSDkOZQtsz1BmHm0RykefGLmSVO8pywIw7Nbd9SaNYA2H9UhF72SFkLlz06ngD 7zWetYyb4EI9+tVO20Szh6KIaclZK7T+jHQwI9Mmvbz1aOBZ9UsNbFheOjYJOrSM4VNv V53g== X-Gm-Message-State: AOAM530gKepaWkWBcJlOXslR+IlkJP7GCBFjKz5sJGH7hON1/vGCmpU7 z2vqiYR25Y1W8Dj5bdau6KJJqA== X-Google-Smtp-Source: ABdhPJwaVYjNyA9FFrUGrQGdP4uz96NnRB5HjSo3iATJSZ3p7plrCXURT/NK0dgwo8X30UaTeptgqg== X-Received: by 2002:a65:5a82:0:b0:386:f95:40fd with SMTP id c2-20020a655a82000000b003860f9540fdmr19931402pgt.256.1649512329068; Sat, 09 Apr 2022 06:52:09 -0700 (PDT) Received: from n227-010-195.byted.org ([121.30.179.44]) by smtp.gmail.com with ESMTPSA id f19-20020a056a00229300b004fb157f136asm30303357pfe.153.2022.04.09.06.52.06 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sat, 09 Apr 2022 06:52:08 -0700 (PDT) From: Abel Wu To: Peter Zijlstra , Mel Gorman , Vincent Guittot Cc: joshdon@google.com, linux-kernel@vger.kernel.org, Abel Wu Subject: [RFC v2 2/2] sched/fair: introduce sched-idle balance Date: Sat, 9 Apr 2022 21:51:04 +0800 Message-Id: <20220409135104.3733193-3-wuyun.abel@bytedance.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20220409135104.3733193-1-wuyun.abel@bytedance.com> References: <20220409135104.3733193-1-wuyun.abel@bytedance.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" The periodic (normal/idle) balancing is regulated by intervals on each sched-domain and the intervals can prevent the unoccupied cpus from pulling the non-idle tasks. While the newly-idle balancing is triggered only when the cpus become really idle, and sadly the sched-idle cpus are not the case. There are also some other constrains to get in the middle of the way of making unoccupied cpus busier. Given the above, the sched-idle balancing is an extension to existing load balance mechanisms on the unoccupied cpus to let them fast pull non-idle tasks from the overloaded cpus. This is achieved by: - Quit early in periodic load balancing if the cpu becomes no idle anymore. This is similar to what we do in newly- idle case in which we stop balancing once we got some work to do (althrough this is partly due to newly-idle can be very frequent, while periodic balancing is not). - The newly-idle balancing will try harder to pull the non- idle tasks if overloaded cpus exist. In this way we will fill the unoccupied cpus more proactively to get more cpu capacity for the non-idle tasks. Signed-off-by: Abel Wu --- include/linux/sched/idle.h | 1 + kernel/sched/core.c | 1 + kernel/sched/fair.c | 145 +++++++++++++++++++++++++++++++++++++++++= +--- kernel/sched/sched.h | 2 + 4 files changed, 142 insertions(+), 7 deletions(-) diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h index d73d314d59c6..50ec5c770f85 100644 --- a/include/linux/sched/idle.h +++ b/include/linux/sched/idle.h @@ -8,6 +8,7 @@ enum cpu_idle_type { CPU_IDLE, CPU_NOT_IDLE, CPU_NEWLY_IDLE, + CPU_SCHED_IDLE, CPU_MAX_IDLE_TYPES }; =20 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a372881f8eaf..c05c39541c4e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9495,6 +9495,7 @@ void __init sched_init(void) rq->wake_stamp =3D jiffies; rq->wake_avg_idle =3D rq->avg_idle; rq->max_idle_balance_cost =3D sysctl_sched_migration_cost; + rq->sched_idle_balance =3D 0; rq->overloaded =3D 0; =20 INIT_LIST_HEAD(&rq->cfs_tasks); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fbeb05321615..5fca3bb98273 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -456,6 +456,21 @@ static int se_is_idle(struct sched_entity *se) return cfs_rq_is_idle(group_cfs_rq(se)); } =20 +/* Is this an idle task */ +static int task_h_idle(struct task_struct *p) +{ + struct sched_entity *se =3D &p->se; + + if (task_has_idle_policy(p)) + return 1; + + for_each_sched_entity(se) + if (cfs_rq_is_idle(cfs_rq_of(se))) + return 1; + + return 0; +} + #else /* !CONFIG_FAIR_GROUP_SCHED */ =20 #define for_each_sched_entity(se) \ @@ -508,6 +523,11 @@ static int se_is_idle(struct sched_entity *se) return 0; } =20 +static inline int task_h_idle(struct task_struct *p) +{ + return task_has_idle_policy(p); +} + #endif /* CONFIG_FAIR_GROUP_SCHED */ =20 static __always_inline @@ -7039,6 +7059,16 @@ static inline bool cfs_rq_overloaded(struct rq *rq) return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1; } =20 +static inline bool cfs_rq_busy(struct rq *rq) +{ + return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running =3D=3D 1; +} + +static inline bool need_pull_cfs_task(struct rq *rq) +{ + return rq->cfs.h_nr_running =3D=3D rq->cfs.idle_h_nr_running; +} + /* * Use locality-friendly rq->overloaded to cache the status of the rq * to minimize the heavy cost on LLC shared data. @@ -7837,6 +7867,22 @@ int can_migrate_task(struct task_struct *p, struct l= b_env *env) if (kthread_is_per_cpu(p)) return 0; =20 + if (unlikely(task_h_idle(p))) { + /* + * Disregard hierarchically idle tasks during sched-idle + * load balancing. + */ + if (env->idle =3D=3D CPU_SCHED_IDLE) + return 0; + } else if (!static_branch_unlikely(&sched_asym_cpucapacity)) { + /* + * It's not gonna help if stacking non-idle tasks on one + * cpu while leaving some idle. + */ + if (cfs_rq_busy(env->src_rq) && !need_pull_cfs_task(env->dst_rq)) + return 0; + } + if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { int cpu; =20 @@ -10337,6 +10383,68 @@ static inline bool update_newidle_cost(struct sche= d_domain *sd, u64 cost) } =20 /* + * The sched-idle balancing tries to make full use of cpu capacity + * for non-idle tasks by pulling them for the unoccupied cpus from + * the overloaded ones. + * + * Return 1 if pulled successfully, 0 otherwise. + */ +static int sched_idle_balance(struct rq *dst_rq) +{ + struct sched_domain *sd; + struct task_struct *p =3D NULL; + int dst_cpu =3D cpu_of(dst_rq), cpu; + + sd =3D rcu_dereference(per_cpu(sd_llc, dst_cpu)); + if (unlikely(!sd)) + return 0; + + if (!atomic_read(&sd->shared->nr_overloaded)) + return 0; + + for_each_cpu_wrap(cpu, sdo_mask(sd->shared), dst_cpu + 1) { + struct rq *rq =3D cpu_rq(cpu); + struct rq_flags rf; + struct lb_env env; + + if (cpu =3D=3D dst_cpu || !cfs_rq_overloaded(rq) || + READ_ONCE(rq->sched_idle_balance)) + continue; + + WRITE_ONCE(rq->sched_idle_balance, 1); + rq_lock_irqsave(rq, &rf); + + env =3D (struct lb_env) { + .sd =3D sd, + .dst_cpu =3D dst_cpu, + .dst_rq =3D dst_rq, + .src_cpu =3D cpu, + .src_rq =3D rq, + .idle =3D CPU_SCHED_IDLE, /* non-idle only */ + .flags =3D LBF_DST_PINNED, /* pin dst_cpu */ + }; + + update_rq_clock(rq); + p =3D detach_one_task(&env); + if (p) + update_overload_status(rq); + + rq_unlock(rq, &rf); + WRITE_ONCE(rq->sched_idle_balance, 0); + + if (p) { + attach_one_task(dst_rq, p); + local_irq_restore(rf.flags); + return 1; + } + + local_irq_restore(rf.flags); + } + + return 0; +} + +/* * It checks each scheduling domain to see if it is due to be balanced, * and initiates a balancing operation if so. * @@ -10356,6 +10464,15 @@ static void rebalance_domains(struct rq *rq, enum = cpu_idle_type idle) u64 max_cost =3D 0; =20 rcu_read_lock(); + + /* + * Quit early if this cpu is no idle any more. It might not be a + * problem since we have already made some contribution to fix + * imbalance. + */ + if (need_pull_cfs_task(rq) && sched_idle_balance(rq)) + continue_balancing =3D 0; + for_each_domain(cpu, sd) { /* * Decay the newidle max times here because this is a regular @@ -10934,7 +11051,8 @@ static int newidle_balance(struct rq *this_rq, stru= ct rq_flags *rf) int this_cpu =3D this_rq->cpu; u64 t0, t1, curr_cost =3D 0; struct sched_domain *sd; - int pulled_task =3D 0; + struct sched_domain_shared *sds; + int pulled_task =3D 0, has_overloaded_cpus =3D 0; =20 update_misfit_status(NULL, this_rq); =20 @@ -10985,6 +11103,11 @@ static int newidle_balance(struct rq *this_rq, str= uct rq_flags *rf) update_blocked_averages(this_cpu); =20 rcu_read_lock(); + + sds =3D rcu_dereference(per_cpu(sd_llc_shared, this_cpu)); + if (likely(sds)) + has_overloaded_cpus =3D atomic_read(&sds->nr_overloaded); + for_each_domain(this_cpu, sd) { int continue_balancing =3D 1; u64 domain_cost; @@ -10996,9 +11119,9 @@ static int newidle_balance(struct rq *this_rq, stru= ct rq_flags *rf) =20 if (sd->flags & SD_BALANCE_NEWIDLE) { =20 - pulled_task =3D load_balance(this_cpu, this_rq, - sd, CPU_NEWLY_IDLE, - &continue_balancing); + pulled_task |=3D load_balance(this_cpu, this_rq, + sd, CPU_NEWLY_IDLE, + &continue_balancing); =20 t1 =3D sched_clock_cpu(this_cpu); domain_cost =3D t1 - t0; @@ -11006,13 +11129,21 @@ static int newidle_balance(struct rq *this_rq, st= ruct rq_flags *rf) =20 curr_cost +=3D domain_cost; t0 =3D t1; + + /* + * Stop searching for tasks to pull if there are + * now runnable tasks on this rq, given that no + * overloaded cpu can be found on this LLC. + */ + if (pulled_task && !has_overloaded_cpus) + break; } =20 /* - * Stop searching for tasks to pull if there are - * now runnable tasks on this rq. + * Try harder to pull non-idle tasks to let them use as more + * cpu capacity as it can be. */ - if (pulled_task || this_rq->nr_running > 0 || + if (this_rq->nr_running > this_rq->cfs.idle_h_nr_running || this_rq->ttwu_pending) break; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index afa1bb68c3ec..dcceaec8d8b4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1012,6 +1012,8 @@ struct rq { =20 unsigned char nohz_idle_balance; unsigned char idle_balance; + + unsigned char sched_idle_balance; unsigned char overloaded; =20 unsigned long misfit_task_load; --=20 2.11.0