From nobody Mon May 11 08:32:02 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1EBA4C433F5
	for <linux-kernel@archiver.kernel.org>; Sat,  9 Apr 2022 13:52:08 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S242245AbiDINyM (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Sat, 9 Apr 2022 09:54:12 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54294 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S234964AbiDINyK (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Sat, 9 Apr 2022 09:54:10 -0400
Received: from mail-pg1-x532.google.com (mail-pg1-x532.google.com
 [IPv6:2607:f8b0:4864:20::532])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CFB9F2558C
        for <linux-kernel@vger.kernel.org>;
 Sat,  9 Apr 2022 06:52:03 -0700 (PDT)
Received: by mail-pg1-x532.google.com with SMTP id q19so10190679pgm.6
        for <linux-kernel@vger.kernel.org>;
 Sat, 09 Apr 2022 06:52:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance-com.20210112.gappssmtp.com; s=20210112;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=dhWe9vjYU7p/PCLLznOYW7dtirlxHpMtsOxO7api5GQ=;
        b=zQpvs/eZxOUqkdUs5/v6hJ79Sy7Eqm9erMDdiP8J54oULUKj0CeIE4NP8bDd7Bng3z
         H1i8AhBhDqGITtLNdvHUe/rAQFNpk+jRCCQfz8+kAcY0vkDoyf8y3I32KU1UmjrBd1dE
         pASPyOTdVhehT8Q1XenPMyCF6XsHVV7N951/PsY2fmLtJNvkI0YEL9OPWx9hRzm+chpD
         4jMDxXtipL33q3NMmCwGTWXjFkftrirxWEf2E/coOQfQjzvc/riwAQupMd1ZeyIk35+r
         MDXdl+BT0gm+WhBuvQxHuav4IWqnQ713vginv1Qlww1T/FpwJtD0bXnkJ7SvJbYcoaXu
         HbXw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=dhWe9vjYU7p/PCLLznOYW7dtirlxHpMtsOxO7api5GQ=;
        b=FBNPWDew1h6uQ5rdOMwBpNRG0Rg5joQjuE0UNbJCduR0QhCjJSuMMAeV+vN4anEGBF
         wdtL8TTgCjCs0wnGBrr15lBoaV+fHtHg8+Xy++sj6f9Tx2zhzTVoj6poPCMaz6jgLtbf
         WOURpfkt8wqKFJnH50jUqc8WiuDfHTs658NwEl7ZHYJgWIxBY2/BOTFYZX4eIUp1XT2A
         PENjPcjFr4O5jzHnVrh9Gvh4cunPM0QiUL+V9qKcz7qdRt4fLxVTOVP9s7fx7rqlmpww
         72w/sPgA1KJRl/Lh60eeItN1xCY4iP2nXtEbfcpZVjn4tDI+XbEyB1NzQ15RkrgqDB0G
         0rDA==
X-Gm-Message-State: AOAM531ul/B6CsxZuQUHsFAUJAIUUhUspsgl4rdSzuNbRuox4YjpBc6E
        KvSMbq+m+tdzMxOCitg4PuKwYQ==
X-Google-Smtp-Source: 
 ABdhPJyOBAM9kTlLTuvz+cykOyavt1JWXrQITfrq/pb/dJ1r87L0WnSpOfs0O9Z4xn1XF9DCCHIjdQ==
X-Received: by 2002:a63:ec46:0:b0:381:81c4:ebbd with SMTP id
 r6-20020a63ec46000000b0038181c4ebbdmr19564561pgj.534.1649512323316;
        Sat, 09 Apr 2022 06:52:03 -0700 (PDT)
Received: from n227-010-195.byted.org ([121.30.179.44])
        by smtp.gmail.com with ESMTPSA id
 f19-20020a056a00229300b004fb157f136asm30303357pfe.153.2022.04.09.06.52.01
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Sat, 09 Apr 2022 06:52:02 -0700 (PDT)
From: Abel Wu <wuyun.abel@bytedance.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Mel Gorman <mgorman@suse.de>,
        Vincent Guittot <vincent.guittot@linaro.org>
Cc: joshdon@google.com, linux-kernel@vger.kernel.org,
        Abel Wu <wuyun.abel@bytedance.com>
Subject: [RFC v2 1/2] sched/fair: filter out overloaded cpus in SIS
Date: Sat,  9 Apr 2022 21:51:03 +0800
Message-Id: <20220409135104.3733193-2-wuyun.abel@bytedance.com>
X-Mailer: git-send-email 2.11.0
In-Reply-To: <20220409135104.3733193-1-wuyun.abel@bytedance.com>
References: <20220409135104.3733193-1-wuyun.abel@bytedance.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"

It would bring benefit if the unoccupied cpus (sched-idle/idle cpus) can
start serving as soon as the non-idle tasks are available. Lots of effort
has already done, and task wakeup path is one of them.

When a task is woken up, the scheduler tends to put it on an unoccupied
cpu to make full use of cpu capacity. But due to scalability issues, the
search depth is bounded to a reasonable limit. IOW it's possible that a
task is woken up on a busy cpu while unoccupied cpus are still out there.

This patch focuses on improving the SIS searching efficiency by filtering
out the overloaded cpus, so as a result the more overloaded the system
is, the less cpus we will search.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/sched/topology.h | 12 ++++++++
 kernel/sched/core.c            |  1 +
 kernel/sched/fair.c            | 65 ++++++++++++++++++++++++++++++++++++++=
+++-
 kernel/sched/sched.h           |  6 ++++
 kernel/sched/topology.c        |  4 ++-
 5 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 56cffe42abbc..fb35a1983568 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -81,6 +81,18 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+
+	/*
+	 * The state of overloaded cpus is for different use against
+	 * the above elements and they are all hot, so start a new
+	 * cacheline to avoid false sharing.
+	 */
+	atomic_t	nr_overloaded	____cacheline_aligned;
+
+	/*
+	 * Must be last
+	 */
+	unsigned long	overloaded[];
 };
=20
 struct sched_domain {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ef946123e9af..a372881f8eaf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9495,6 +9495,7 @@ void __init sched_init(void)
 		rq->wake_stamp =3D jiffies;
 		rq->wake_avg_idle =3D rq->avg_idle;
 		rq->max_idle_balance_cost =3D sysctl_sched_migration_cost;
+		rq->overloaded =3D 0;
=20
 		INIT_LIST_HEAD(&rq->cfs_tasks);
=20
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 16874e112fe6..fbeb05321615 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6284,6 +6284,15 @@ static inline int select_idle_smt(struct task_struct=
 *p, struct sched_domain *sd
 #endif /* CONFIG_SCHED_SMT */
=20
 /*
+ * It would be very unlikely to find an unoccupied cpu when system is heav=
ily
+ * overloaded. Even if we could, the cost might bury the benefit.
+ */
+static inline bool sched_domain_overloaded(struct sched_domain *sd, int nr=
_overloaded)
+{
+	return nr_overloaded > sd->span_weight - (sd->span_weight >> 4);
+}
+
+/*
  * Scan the LLC domain for idle CPUs; this is dynamically regulated by
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against =
the
  * average idle time for this rq (as found in rq->avg_idle).
@@ -6291,7 +6300,7 @@ static inline int select_idle_smt(struct task_struct =
*p, struct sched_domain *sd
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd,=
 bool has_idle_core, int target)
 {
 	struct cpumask *cpus =3D this_cpu_cpumask_var_ptr(select_idle_mask);
-	int i, cpu, idle_cpu =3D -1, nr =3D INT_MAX;
+	int i, cpu, idle_cpu =3D -1, nr =3D INT_MAX, nro;
 	struct rq *this_rq =3D this_rq();
 	int this =3D smp_processor_id();
 	struct sched_domain *this_sd;
@@ -6301,7 +6310,13 @@ static int select_idle_cpu(struct task_struct *p, st=
ruct sched_domain *sd, bool
 	if (!this_sd)
 		return -1;
=20
+	nro =3D atomic_read(&sd->shared->nr_overloaded);
+	if (sched_domain_overloaded(sd, nro))
+		return -1;
+
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+	if (nro)
+		cpumask_andnot(cpus, cpus, sdo_mask(sd->shared));
=20
 	if (sched_feat(SIS_PROP) && !has_idle_core) {
 		u64 avg_cost, avg_idle, span_avg;
@@ -7018,6 +7033,51 @@ balance_fair(struct rq *rq, struct task_struct *prev=
, struct rq_flags *rf)
=20
 	return newidle_balance(rq, rf) !=3D 0;
 }
+
+static inline bool cfs_rq_overloaded(struct rq *rq)
+{
+	return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
+}
+
+/*
+ * Use locality-friendly rq->overloaded to cache the status of the rq
+ * to minimize the heavy cost on LLC shared data.
+ *
+ * Must be called with rq locked
+ */
+static void update_overload_status(struct rq *rq)
+{
+	struct sched_domain_shared *sds;
+	bool overloaded =3D cfs_rq_overloaded(rq);
+	int cpu =3D cpu_of(rq);
+
+	lockdep_assert_rq_held(rq);
+
+	if (rq->overloaded =3D=3D overloaded)
+		return;
+
+	rcu_read_lock();
+	sds =3D rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (unlikely(!sds))
+		goto unlock;
+
+	if (overloaded) {
+		cpumask_set_cpu(cpu, sdo_mask(sds));
+		atomic_inc(&sds->nr_overloaded);
+	} else {
+		cpumask_clear_cpu(cpu, sdo_mask(sds));
+		atomic_dec(&sds->nr_overloaded);
+	}
+
+	rq->overloaded =3D overloaded;
+unlock:
+	rcu_read_unlock();
+}
+
+#else
+
+static inline void update_overload_status(struct rq *rq) { }
+
 #endif /* CONFIG_SMP */
=20
 static unsigned long wakeup_gran(struct sched_entity *se)
@@ -7365,6 +7425,8 @@ done: __maybe_unused;
 	if (new_tasks > 0)
 		goto again;
=20
+	update_overload_status(rq);
+
 	/*
 	 * rq is about to be idle, check if we need to update the
 	 * lost_idle_time of clock_pelt
@@ -11183,6 +11245,7 @@ static void task_tick_fair(struct rq *rq, struct ta=
sk_struct *curr, int queued)
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
=20
+	update_overload_status(rq);
 	update_misfit_status(curr, rq);
 	update_overutilized_status(task_rq(curr));
=20
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3da5718cd641..afa1bb68c3ec 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1012,6 +1012,7 @@ struct rq {
=20
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
+	unsigned char		overloaded;
=20
 	unsigned long		misfit_task_load;
=20
@@ -1764,6 +1765,11 @@ static inline struct sched_domain *lowest_flag_domai=
n(int cpu, int flag)
 	return sd;
 }
=20
+static inline struct cpumask *sdo_mask(struct sched_domain_shared *sds)
+{
+	return to_cpumask(sds->overloaded);
+}
+
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 32841c6741d1..fea1294ebd16 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1621,6 +1621,8 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->shared =3D *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
+		atomic_set(&sd->shared->nr_overloaded, 0);
+		cpumask_clear(sdo_mask(sd->shared));
 	}
=20
 	sd->private =3D sdd;
@@ -2086,7 +2088,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
=20
 			*per_cpu_ptr(sdd->sd, j) =3D sd;
=20
-			sds =3D kzalloc_node(sizeof(struct sched_domain_shared),
+			sds =3D kzalloc_node(sizeof(struct sched_domain_shared) + cpumask_size(=
),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sds)
 				return -ENOMEM;
--=20
2.11.0
From nobody Mon May 11 08:32:02 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E40D7C433EF
	for <linux-kernel@archiver.kernel.org>; Sat,  9 Apr 2022 13:52:22 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S242261AbiDINy0 (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Sat, 9 Apr 2022 09:54:26 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54904 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S242260AbiDINyR (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Sat, 9 Apr 2022 09:54:17 -0400
Received: from mail-pf1-x42e.google.com (mail-pf1-x42e.google.com
 [IPv6:2607:f8b0:4864:20::42e])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 95FAE9E9DA
        for <linux-kernel@vger.kernel.org>;
 Sat,  9 Apr 2022 06:52:09 -0700 (PDT)
Received: by mail-pf1-x42e.google.com with SMTP id z16so10789341pfh.3
        for <linux-kernel@vger.kernel.org>;
 Sat, 09 Apr 2022 06:52:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance-com.20210112.gappssmtp.com; s=20210112;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=F7+54fI+N/qxFO9vVcmvSii7KGTDscwJajxT6dy70+I=;
        b=yMQgNpI491m5t6rJlJqTP6eEpEfw9kFNo6NybotzpFtC/+uAMcddh94rK6EtY7Toy2
         pKbFfhLb5/nhzmEfRZxi+FWSgXm9iKM4Qu9r7QnrFT+i31VOUTAV6a+xZDsGwMUXGzDC
         x5z5I0lMtP5vj7iXtvgb0ZtN8Fu6snJorK0seiXMwF2WtP0FlrkK7RkV2pforQCVG/LP
         C8ihhnbu8cWeezxBIhcYe8PripjmAr9NB2Ld4fcttYoaAKr6dA4TRTs9VeYtp4nMYWHx
         /MLy73hsu8FeHSe2ZnotK+1aj63jBOXnNxWgNdFKnTIQROxAZIFeeafh7F5iTN3BWb5c
         1oMw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=F7+54fI+N/qxFO9vVcmvSii7KGTDscwJajxT6dy70+I=;
        b=lhBOF0qB8TlTOSCteD/ACijhBdMgUg5++xmMz33H5CjbaWZR3bW/GaMiERt9quZQdN
         lJFx4byC6TA2nSD4vVuS5NIVDpD+44i7/bIF4JV43MoUcwxBzQ1rDa+5T+WlU0kTD99K
         07wZwa5AbMTCHCttzrtQQ1ZIH8EXt0CWpKGiBpHJuOw37jplJhAApAsfEzwlcuMnS1F1
         rMArGibSDkOZQtsz1BmHm0RykefGLmSVO8pywIw7Nbd9SaNYA2H9UhF72SFkLlz06ngD
         7zWetYyb4EI9+tVO20Szh6KIaclZK7T+jHQwI9Mmvbz1aOBZ9UsNbFheOjYJOrSM4VNv
         V53g==
X-Gm-Message-State: AOAM530gKepaWkWBcJlOXslR+IlkJP7GCBFjKz5sJGH7hON1/vGCmpU7
        z2vqiYR25Y1W8Dj5bdau6KJJqA==
X-Google-Smtp-Source: 
 ABdhPJwaVYjNyA9FFrUGrQGdP4uz96NnRB5HjSo3iATJSZ3p7plrCXURT/NK0dgwo8X30UaTeptgqg==
X-Received: by 2002:a65:5a82:0:b0:386:f95:40fd with SMTP id
 c2-20020a655a82000000b003860f9540fdmr19931402pgt.256.1649512329068;
        Sat, 09 Apr 2022 06:52:09 -0700 (PDT)
Received: from n227-010-195.byted.org ([121.30.179.44])
        by smtp.gmail.com with ESMTPSA id
 f19-20020a056a00229300b004fb157f136asm30303357pfe.153.2022.04.09.06.52.06
        (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
        Sat, 09 Apr 2022 06:52:08 -0700 (PDT)
From: Abel Wu <wuyun.abel@bytedance.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Mel Gorman <mgorman@suse.de>,
        Vincent Guittot <vincent.guittot@linaro.org>
Cc: joshdon@google.com, linux-kernel@vger.kernel.org,
        Abel Wu <wuyun.abel@bytedance.com>
Subject: [RFC v2 2/2] sched/fair: introduce sched-idle balance
Date: Sat,  9 Apr 2022 21:51:04 +0800
Message-Id: <20220409135104.3733193-3-wuyun.abel@bytedance.com>
X-Mailer: git-send-email 2.11.0
In-Reply-To: <20220409135104.3733193-1-wuyun.abel@bytedance.com>
References: <20220409135104.3733193-1-wuyun.abel@bytedance.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"

The periodic (normal/idle) balancing is regulated by intervals on each
sched-domain and the intervals can prevent the unoccupied cpus from
pulling the non-idle tasks. While the newly-idle balancing is triggered
only when the cpus become really idle, and sadly the sched-idle cpus
are not the case. There are also some other constrains to get in the
middle of the way of making unoccupied cpus busier.

Given the above, the sched-idle balancing is an extension to existing
load balance mechanisms on the unoccupied cpus to let them fast pull
non-idle tasks from the overloaded cpus. This is achieved by:

  - Quit early in periodic load balancing if the cpu becomes
    no idle anymore. This is similar to what we do in newly-
    idle case in which we stop balancing once we got some
    work to do (althrough this is partly due to newly-idle
    can be very frequent, while periodic balancing is not).

  - The newly-idle balancing will try harder to pull the non-
    idle tasks if overloaded cpus exist.

In this way we will fill the unoccupied cpus more proactively to get
more cpu capacity for the non-idle tasks.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/sched/idle.h |   1 +
 kernel/sched/core.c        |   1 +
 kernel/sched/fair.c        | 145 +++++++++++++++++++++++++++++++++++++++++=
+---
 kernel/sched/sched.h       |   2 +
 4 files changed, 142 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index d73d314d59c6..50ec5c770f85 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -8,6 +8,7 @@ enum cpu_idle_type {
 	CPU_IDLE,
 	CPU_NOT_IDLE,
 	CPU_NEWLY_IDLE,
+	CPU_SCHED_IDLE,
 	CPU_MAX_IDLE_TYPES
 };
=20
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a372881f8eaf..c05c39541c4e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9495,6 +9495,7 @@ void __init sched_init(void)
 		rq->wake_stamp =3D jiffies;
 		rq->wake_avg_idle =3D rq->avg_idle;
 		rq->max_idle_balance_cost =3D sysctl_sched_migration_cost;
+		rq->sched_idle_balance =3D 0;
 		rq->overloaded =3D 0;
=20
 		INIT_LIST_HEAD(&rq->cfs_tasks);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbeb05321615..5fca3bb98273 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -456,6 +456,21 @@ static int se_is_idle(struct sched_entity *se)
 	return cfs_rq_is_idle(group_cfs_rq(se));
 }
=20
+/* Is this an idle task */
+static int task_h_idle(struct task_struct *p)
+{
+	struct sched_entity *se =3D &p->se;
+
+	if (task_has_idle_policy(p))
+		return 1;
+
+	for_each_sched_entity(se)
+		if (cfs_rq_is_idle(cfs_rq_of(se)))
+			return 1;
+
+	return 0;
+}
+
 #else	/* !CONFIG_FAIR_GROUP_SCHED */
=20
 #define for_each_sched_entity(se) \
@@ -508,6 +523,11 @@ static int se_is_idle(struct sched_entity *se)
 	return 0;
 }
=20
+static inline int task_h_idle(struct task_struct *p)
+{
+	return task_has_idle_policy(p);
+}
+
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
=20
 static __always_inline
@@ -7039,6 +7059,16 @@ static inline bool cfs_rq_overloaded(struct rq *rq)
 	return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running > 1;
 }
=20
+static inline bool cfs_rq_busy(struct rq *rq)
+{
+	return rq->cfs.h_nr_running - rq->cfs.idle_h_nr_running =3D=3D 1;
+}
+
+static inline bool need_pull_cfs_task(struct rq *rq)
+{
+	return rq->cfs.h_nr_running =3D=3D rq->cfs.idle_h_nr_running;
+}
+
 /*
  * Use locality-friendly rq->overloaded to cache the status of the rq
  * to minimize the heavy cost on LLC shared data.
@@ -7837,6 +7867,22 @@ int can_migrate_task(struct task_struct *p, struct l=
b_env *env)
 	if (kthread_is_per_cpu(p))
 		return 0;
=20
+	if (unlikely(task_h_idle(p))) {
+		/*
+		 * Disregard hierarchically idle tasks during sched-idle
+		 * load balancing.
+		 */
+		if (env->idle =3D=3D CPU_SCHED_IDLE)
+			return 0;
+	} else if (!static_branch_unlikely(&sched_asym_cpucapacity)) {
+		/*
+		 * It's not gonna help if stacking non-idle tasks on one
+		 * cpu while leaving some idle.
+		 */
+		if (cfs_rq_busy(env->src_rq) && !need_pull_cfs_task(env->dst_rq))
+			return 0;
+	}
+
 	if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
 		int cpu;
=20
@@ -10337,6 +10383,68 @@ static inline bool update_newidle_cost(struct sche=
d_domain *sd, u64 cost)
 }
=20
 /*
+ * The sched-idle balancing tries to make full use of cpu capacity
+ * for non-idle tasks by pulling them for the unoccupied cpus from
+ * the overloaded ones.
+ *
+ * Return 1 if pulled successfully, 0 otherwise.
+ */
+static int sched_idle_balance(struct rq *dst_rq)
+{
+	struct sched_domain *sd;
+	struct task_struct *p =3D NULL;
+	int dst_cpu =3D cpu_of(dst_rq), cpu;
+
+	sd =3D rcu_dereference(per_cpu(sd_llc, dst_cpu));
+	if (unlikely(!sd))
+		return 0;
+
+	if (!atomic_read(&sd->shared->nr_overloaded))
+		return 0;
+
+	for_each_cpu_wrap(cpu, sdo_mask(sd->shared), dst_cpu + 1) {
+		struct rq *rq =3D cpu_rq(cpu);
+		struct rq_flags rf;
+		struct lb_env env;
+
+		if (cpu =3D=3D dst_cpu || !cfs_rq_overloaded(rq) ||
+		    READ_ONCE(rq->sched_idle_balance))
+			continue;
+
+		WRITE_ONCE(rq->sched_idle_balance, 1);
+		rq_lock_irqsave(rq, &rf);
+
+		env =3D (struct lb_env) {
+			.sd		=3D sd,
+			.dst_cpu	=3D dst_cpu,
+			.dst_rq		=3D dst_rq,
+			.src_cpu	=3D cpu,
+			.src_rq		=3D rq,
+			.idle		=3D CPU_SCHED_IDLE, /* non-idle only */
+			.flags		=3D LBF_DST_PINNED, /* pin dst_cpu */
+		};
+
+		update_rq_clock(rq);
+		p =3D detach_one_task(&env);
+		if (p)
+			update_overload_status(rq);
+
+		rq_unlock(rq, &rf);
+		WRITE_ONCE(rq->sched_idle_balance, 0);
+
+		if (p) {
+			attach_one_task(dst_rq, p);
+			local_irq_restore(rf.flags);
+			return 1;
+		}
+
+		local_irq_restore(rf.flags);
+	}
+
+	return 0;
+}
+
+/*
  * It checks each scheduling domain to see if it is due to be balanced,
  * and initiates a balancing operation if so.
  *
@@ -10356,6 +10464,15 @@ static void rebalance_domains(struct rq *rq, enum =
cpu_idle_type idle)
 	u64 max_cost =3D 0;
=20
 	rcu_read_lock();
+
+	/*
+	 * Quit early if this cpu is no idle any more. It might not be a
+	 * problem since we have already made some contribution to fix
+	 * imbalance.
+	 */
+	if (need_pull_cfs_task(rq) && sched_idle_balance(rq))
+		continue_balancing =3D 0;
+
 	for_each_domain(cpu, sd) {
 		/*
 		 * Decay the newidle max times here because this is a regular
@@ -10934,7 +11051,8 @@ static int newidle_balance(struct rq *this_rq, stru=
ct rq_flags *rf)
 	int this_cpu =3D this_rq->cpu;
 	u64 t0, t1, curr_cost =3D 0;
 	struct sched_domain *sd;
-	int pulled_task =3D 0;
+	struct sched_domain_shared *sds;
+	int pulled_task =3D 0, has_overloaded_cpus =3D 0;
=20
 	update_misfit_status(NULL, this_rq);
=20
@@ -10985,6 +11103,11 @@ static int newidle_balance(struct rq *this_rq, str=
uct rq_flags *rf)
 	update_blocked_averages(this_cpu);
=20
 	rcu_read_lock();
+
+	sds =3D rcu_dereference(per_cpu(sd_llc_shared, this_cpu));
+	if (likely(sds))
+		has_overloaded_cpus =3D atomic_read(&sds->nr_overloaded);
+
 	for_each_domain(this_cpu, sd) {
 		int continue_balancing =3D 1;
 		u64 domain_cost;
@@ -10996,9 +11119,9 @@ static int newidle_balance(struct rq *this_rq, stru=
ct rq_flags *rf)
=20
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
=20
-			pulled_task =3D load_balance(this_cpu, this_rq,
-						   sd, CPU_NEWLY_IDLE,
-						   &continue_balancing);
+			pulled_task |=3D load_balance(this_cpu, this_rq,
+						    sd, CPU_NEWLY_IDLE,
+						    &continue_balancing);
=20
 			t1 =3D sched_clock_cpu(this_cpu);
 			domain_cost =3D t1 - t0;
@@ -11006,13 +11129,21 @@ static int newidle_balance(struct rq *this_rq, st=
ruct rq_flags *rf)
=20
 			curr_cost +=3D domain_cost;
 			t0 =3D t1;
+
+			/*
+			 * Stop searching for tasks to pull if there are
+			 * now runnable tasks on this rq, given that no
+			 * overloaded cpu can be found on this LLC.
+			 */
+			if (pulled_task && !has_overloaded_cpus)
+				break;
 		}
=20
 		/*
-		 * Stop searching for tasks to pull if there are
-		 * now runnable tasks on this rq.
+		 * Try harder to pull non-idle tasks to let them use as more
+		 * cpu capacity as it can be.
 		 */
-		if (pulled_task || this_rq->nr_running > 0 ||
+		if (this_rq->nr_running > this_rq->cfs.idle_h_nr_running ||
 		    this_rq->ttwu_pending)
 			break;
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index afa1bb68c3ec..dcceaec8d8b4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1012,6 +1012,8 @@ struct rq {
=20
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
+
+	unsigned char		sched_idle_balance;
 	unsigned char		overloaded;
=20
 	unsigned long		misfit_task_load;
--=20
2.11.0