From nobody Mon Feb  9 14:06:01 2026
Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com
 [209.85.221.50])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E96501AD5D8
	for <linux-kernel@vger.kernel.org>; Fri, 30 Aug 2024 13:03:16 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.221.50
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1725022998; cv=none;
 b=Y8scp/uMWr5hYftBUq7NO7NnH9Si+IrIdNncK7AtJ0iDBqPxHa2W6VUQCZ0e9UEaR8knv12aDOBdh2uDvggF9xOZHUP4weh7rgX11Mu3sBZQLnOTP8eV4xgR8gP2Wl0KNVJSjjcELI/P8wI+vLhxcYyOLIXRY4SJ/zybSLtOe88=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1725022998; c=relaxed/simple;
	bh=Jpc5j3N6ICS48R8vlUAWLFgq+QPIEAVfChN7GQm0/VY=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=jCB3i9SDxgrzGUjKFXm5XIZJFOoFLNv0+a0mRrqwXw/ZB4cX9GV0pAIRgLunGSCBCfKjLzuYkQmOIoeRt7a6r3Nv6meitbFjZUVY7f6jGJ0/PV+jOh4F+hDaWjIOG5/exzJuOhBRaFvuZz24T2aATwNMkNyaKWPtoSAyTucDugo=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linaro.org;
 spf=pass smtp.mailfrom=linaro.org;
 dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org
 header.b=rq66yXpY; arc=none smtp.client-ip=209.85.221.50
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linaro.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org
 header.b="rq66yXpY"
Received: by mail-wr1-f50.google.com with SMTP id
 ffacd0b85a97d-3718b5e9c4fso1060117f8f.0
        for <linux-kernel@vger.kernel.org>;
 Fri, 30 Aug 2024 06:03:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1725022995; x=1725627795;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=dMwhKBhd7gDlMAyvQwSLcnndC8gEEBFIqZCsXkx4uYw=;
        b=rq66yXpY5CuWe1jLUiizPobgtB3CUg6RyH83GHxk5rIXnqvWZbsrbBGcrZLM/Qn36Q
         Gec46ZT+wDV5BtcxltjQDvZiJB+acJL6Ya1z+M8DqI1acsxBUKJaogfTFPUqiyBWVYh8
         xNYS4VwYh5YshJ/Wu2P1EYxbeb858DOuqsCi5w9ONpkYIWLV/bB91NSsukFJXJ792irM
         F7hyWmPNysm9GTJBCLbKpNXe9iL0oB6gxv2uOSPBgK38xHSbS8rY/IOiXifkej4mOg4f
         nO6e+4sLf0wMf6FK8V2gcZgeNAJVaSM1NgPn9J+Joy/QFH5ECLWF94nkPt6+HfrWmaLC
         v8BQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725022995; x=1725627795;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=dMwhKBhd7gDlMAyvQwSLcnndC8gEEBFIqZCsXkx4uYw=;
        b=tlCrdeQC1LRiCFiTM+ufp1s39dTySbvMoCUBw/Ip7KJx10/TApU+QbWtlBhdG/vpPs
         puqqTVAWqbDSxeTRLlzgv7hGEKoUHfhY1d2z/Ow7x108L7Pn77F/HMo1QE703OYZ7o/a
         gIY6e+7mEs2IB7Rp+ZUYbjzY4Si7EXhQzw6BksW4DmpZq0qjDWNdmpXKx8ytjXI4S2wY
         GS8TZ5iFkDAhG3eA6kCNtPdJp8buB05oUiz7SE4GMWdTIc5fpEE6JkAtT95qf/Oe5TKZ
         AV5dZ2JmQ3C4tTp43I/b2Y2yW99aM8SroHZZL3c/DZAZSPJv+0FuzF9M6zBArwaos1ND
         658g==
X-Forwarded-Encrypted: i=1;
 AJvYcCVrpc49UPQ5GfgfxL/oa1JPzb6x61eqVCSJgR/G4BcPjofgvA5F35VGCd+ARec9+5RyeCEGY7s/wJE5rN0=@vger.kernel.org
X-Gm-Message-State: AOJu0YwhsrIfrT9kZ5ccd5SC/LK+t7xStsJFTzhcbdbwTzDg8t3jTUVA
	+bwNpjRR0f0RvWX8ZZllqwv2Wiq4qk43hfvpfOvI9NuSdNyyiXOuCAaOgdv42UI=
X-Google-Smtp-Source: 
 AGHT+IEU5fdiuT7GBubhrCXOyK0/J8FkW898uX+vMq8PnXnsjzyXHC1beTDtqDQFj/gTl3N+dnk/rQ==
X-Received: by 2002:a05:6000:154f:b0:36d:2984:ef6b with SMTP id
 ffacd0b85a97d-3749b526f30mr5854050f8f.11.1725022995285;
        Fri, 30 Aug 2024 06:03:15 -0700 (PDT)
Received: from localhost.localdomain ([2a01:e0a:f:6020:3cfc:139d:f91:133])
        by smtp.gmail.com with ESMTPSA id
 ffacd0b85a97d-3749efaf35asm3954076f8f.90.2024.08.30.06.03.14
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 30 Aug 2024 06:03:14 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: mingo@redhat.com,
	peterz@infradead.org,
	juri.lelli@redhat.com,
	dietmar.eggemann@arm.com,
	rostedt@goodmis.org,
	bsegall@google.com,
	mgorman@suse.de,
	vschneid@redhat.com,
	lukasz.luba@arm.com,
	rafael.j.wysocki@intel.com,
	linux-kernel@vger.kernel.org
Cc: qyousef@layalina.io,
	hongyan.xia2@arm.com,
	Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS
Date: Fri, 30 Aug 2024 15:03:05 +0200
Message-Id: <20240830130309.2141697-2-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240830130309.2141697-1-vincent.guittot@linaro.org>
References: <20240830130309.2141697-1-vincent.guittot@linaro.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

With EAS, a group should be set overloaded if at least 1 CPU in the group
is overutilized bit it can happen that a CPU is fully utilized by tasks
because of clamping the compute capacity of the CPU. In such case, the CPU
is not overutilized and as a result should not be set overloaded as well.

group_overloaded being a higher priority than group_misfit, such group can
be selected as the busiest group instead of a group with a mistfit task
and prevents load_balance to select the CPU with the misfit task to pull
the latter on a fitting CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
---
 kernel/sched/fair.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea057b311f6..e67d6029b269 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9806,6 +9806,7 @@ struct sg_lb_stats {
 	enum group_type group_type;
 	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CP=
U */
 	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
+	unsigned long group_overutilized;	/* No CPU is overutilized in the group =
*/
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its=
 capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -10039,6 +10040,13 @@ group_has_capacity(unsigned int imbalance_pct, str=
uct sg_lb_stats *sgs)
 static inline bool
 group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 {
+	/*
+	 * With EAS and uclamp, 1 CPU in the group must be overutilized to
+	 * consider the group overloaded.
+	 */
+	if (sched_energy_enabled() && !sgs->group_overutilized)
+		return false;
+
 	if (sgs->sum_nr_running <=3D sgs->group_weight)
 		return false;
=20
@@ -10252,8 +10260,10 @@ static inline void update_sg_lb_stats(struct lb_en=
v *env,
 		if (nr_running > 1)
 			*sg_overloaded =3D 1;
=20
-		if (cpu_overutilized(i))
+		if (cpu_overutilized(i)) {
 			*sg_overutilized =3D 1;
+			sgs->group_overutilized =3D 1;
+		}
=20
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running +=3D rq->nr_numa_running;
--=20
2.34.1
From nobody Mon Feb  9 14:06:01 2026
Received: from mail-wr1-f53.google.com (mail-wr1-f53.google.com
 [209.85.221.53])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 903D61AE056
	for <linux-kernel@vger.kernel.org>; Fri, 30 Aug 2024 13:03:18 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.221.53
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1725023000; cv=none;
 b=SpK/emFcEJ5p5Qq4HLiJqgjfziHtanvlGvF8Nl2yWVz22yxMRdQmWiAzk32lJC23T2zG3yoqO9lnbbzX7sAmCaa0ZFc8dOm59s+GpcJu1KVx7g1BoAV5q4pEpalip3G3wnbPfO6HvVI9B5LxuVpSi3hKfIZepVXzktvFDLUfHls=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1725023000; c=relaxed/simple;
	bh=MvE2RHEdT8Xj0MN5+TcXChbjAgu/9HVucOxy3lO/Ufo=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=KtaAgnB3xC9qjyoj1dPs+LcKG05DMekXmki8DrnXIbDbCNpnCnDM/UiE2zx62yp4TFvX7YFlEsVrb6iIWUWm0u9W7jS8uZ+iwFpGkuel5Sk1WpFsyBcjs1RNYFP7dCTEUMxfN1DRwTzqwEJjujK/8Q+MFsptBgpJdIrnltLVz7s=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linaro.org;
 spf=pass smtp.mailfrom=linaro.org;
 dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org
 header.b=zLYUX4J3; arc=none smtp.client-ip=209.85.221.53
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linaro.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org
 header.b="zLYUX4J3"
Received: by mail-wr1-f53.google.com with SMTP id
 ffacd0b85a97d-371893dd249so1306438f8f.2
        for <linux-kernel@vger.kernel.org>;
 Fri, 30 Aug 2024 06:03:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1725022997; x=1725627797;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=+G4GSnsP/HSvAv8G0cBHEEfYfdU1eVDY3ZlJoxKRkK8=;
        b=zLYUX4J3MCKg1oVlaqm0Y+Nhe2OzFEaeG2qNFwc/mJoP1+MBjPQQjfs0LO2oSj9+Dq
         u1NAarbBqfKMpTHYd/YYXLRFRIGO6CT8HkCVbS6dptJgrBILh9FYCGBsxW3/1yND3YVa
         bxu4iDPHYxp1rjJy9pNmdyXRoGNg3s3pSy9Ilt+hDTpCA/AjuZGy5N88tZ+5gyOhWsXD
         IRkgxySE/KsUxF5xtOvzUJMd1BreECF0iqBbCan2/GwvAv9U/3218GBXtNl/qbE518Og
         dWrJYQOii8/rUKwA5AOt2ObfrofAgI06zY8Q+IH9Ne+lIv+FjVerKDb0Ly+d7B1maqO2
         r/vg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725022997; x=1725627797;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=+G4GSnsP/HSvAv8G0cBHEEfYfdU1eVDY3ZlJoxKRkK8=;
        b=jzpsQ0A3z6jZqvqVOfZWTwjpQhajCP6/W39bttSdmqLpMUg00EOrJqedExuMZmw2HI
         TWftyXUEOqvU96YLnemI1m4f/gZ+rnCHBbvNldzTGWbvpEbvPPO1NA+7Mzuh3RvflR2+
         bbNt5yqiVIOvEJbmbzABsA9EKMqBtuOwD79CQqbSCqCrDaIDBAP/7rSsJf+puNUNP0x3
         l+gcu4XyFuAk0oTI5jbaxz7ytqiGR2vOU1qz4thBPskWStsJaPaxec9IWYtfigz/tUYp
         ejF5QsuuyZY2IgjXBiISw9MdN68TCq5KqTnoO4DiIhffy3Q/pNJIHyq2tWCygAs8lfDl
         MOAg==
X-Forwarded-Encrypted: i=1;
 AJvYcCUAmtAucWLoMQfQ9Cksy0NyZoo/leKJUdfkiTCAP0LPfY3mh7CA1bFRml2/dVH00HSPW2HbHXBlxDJwafA=@vger.kernel.org
X-Gm-Message-State: AOJu0Yz2FWF/ee3F+1v+NDNLsbefpPd3JrdITUdIyr8bKslQnmU+3HYx
	Cx7hnHAZn9ecDpesZ2h8cAEZIA6ko59lp3Bj7lwa4q3paHZDSqzN3NZf/Z+/uzg=
X-Google-Smtp-Source: 
 AGHT+IHH0zmn4hPpwiJoclfihTw8rkWZegJPSurH29Il1g7OPCRZ0DjjeVOFVmE5sWS47n2zlP6pFw==
X-Received: by 2002:a05:6000:acd:b0:367:8f81:fa09 with SMTP id
 ffacd0b85a97d-3749b58164amr4253220f8f.47.1725022996715;
        Fri, 30 Aug 2024 06:03:16 -0700 (PDT)
Received: from localhost.localdomain ([2a01:e0a:f:6020:3cfc:139d:f91:133])
        by smtp.gmail.com with ESMTPSA id
 ffacd0b85a97d-3749efaf35asm3954076f8f.90.2024.08.30.06.03.15
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 30 Aug 2024 06:03:15 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: mingo@redhat.com,
	peterz@infradead.org,
	juri.lelli@redhat.com,
	dietmar.eggemann@arm.com,
	rostedt@goodmis.org,
	bsegall@google.com,
	mgorman@suse.de,
	vschneid@redhat.com,
	lukasz.luba@arm.com,
	rafael.j.wysocki@intel.com,
	linux-kernel@vger.kernel.org
Cc: qyousef@layalina.io,
	hongyan.xia2@arm.com,
	Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH 2/5] energy model: Add a get previous state function
Date: Fri, 30 Aug 2024 15:03:06 +0200
Message-Id: <20240830130309.2141697-3-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240830130309.2141697-1-vincent.guittot@linaro.org>
References: <20240830130309.2141697-1-vincent.guittot@linaro.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Instead of parsing all EM table everytime, add a function to get the
previous state.

Will be used in the scheduler feec() function.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/energy_model.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index 1ff52020cf75..ea8ea7e031c0 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -207,6 +207,24 @@ em_pd_get_efficient_state(struct em_perf_state *table,=
 int nr_perf_states,
 	return nr_perf_states - 1;
 }
=20
+static inline int
+em_pd_get_previous_state(struct em_perf_state *table, int nr_perf_states,
+			  int idx, unsigned long pd_flags)
+{
+	struct em_perf_state *ps;
+	int i;
+
+	for (i =3D idx - 1; i >=3D 0; i--) {
+		ps =3D &table[i];
+		if (pd_flags & EM_PERF_DOMAIN_SKIP_INEFFICIENCIES &&
+		    ps->flags & EM_PERF_STATE_INEFFICIENT)
+			continue;
+		return i;
+	}
+
+	return -1;
+}
+
 /**
  * em_cpu_energy() - Estimates the energy consumed by the CPUs of a
  *		performance domain
--=20
2.34.1
From nobody Mon Feb  9 14:06:01 2026
Received: from mail-wm1-f52.google.com (mail-wm1-f52.google.com
 [209.85.128.52])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0C541AF4EB
	for <linux-kernel@vger.kernel.org>; Fri, 30 Aug 2024 13:03:19 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.52
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1725023002; cv=none;
 b=QRmSmmSBkC26TBGMsHCtPofwhX3g2Q0qSjsf3Fm8hm1VbZHXDWTb8HbZc5GT1CL8pNPuZq99GJqvRjmpQnarPL2gPkgvQejYQzDppwTRkT8Fk1GlKvrGISGA4ARaaBHyPkgMqj3agTFAP0f9PoksTGEWgrLeH+2m9z6jGrOxwQQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1725023002; c=relaxed/simple;
	bh=9HmsRl1ZKBTRl2+MDp0/GUQxxay3vv93aRdinbjdA+A=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=MPxmiYP4pf3A7HK/BMUnXQmExGD6wbtUN+Cndc6EXKXyhfl2HaGsAgCoBaZ1hQeQf3eacZbTGsGqCZPKY000ZjS/N3VaIeLL6HIxCOYWXIAyezENLV3ZT3yOuzpQK/uRMc5AppSBewGQCbeeIqr8+vIastht8QwsnG5ylFz2eO8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linaro.org;
 spf=pass smtp.mailfrom=linaro.org;
 dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org
 header.b=GkDRb6jk; arc=none smtp.client-ip=209.85.128.52
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linaro.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org
 header.b="GkDRb6jk"
Received: by mail-wm1-f52.google.com with SMTP id
 5b1f17b1804b1-42bb885f97eso7886035e9.0
        for <linux-kernel@vger.kernel.org>;
 Fri, 30 Aug 2024 06:03:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1725022998; x=1725627798;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=Wp1EI73bJNLlvb/fdhUIeKhwRhBXZtilTY/v4ngKN+U=;
        b=GkDRb6jkOkQR2acMr0NCfF+JtiU/da+EeLmzTFG95P5rYHXkpRtUOCbmDP9cK37KK6
         hg0sxYOnRFqIITbZlzwVlTWLJZ0SL65GB2HicLu8WtD4FateRgkaprvN3eF+dBSWQQHK
         oh1W+7Zj8aZPcxttDYjpLvUMZwE3fqpM6JejvMYdqbd0tdCJhUFWPTSLieUHZWD+6Pc9
         UxY8LxYQBV1rOGDiafv4OBawxJuwJ9awc58rSZROz3DjoMs6yLRsVYy4Y+l4OzuUEKmz
         Sp5cJ+S7MP9Mqp7CQL6g1qK9Qkv4UZOSheSWpc1RTcXDq0yF2aZXnDUmdiDpSs+0i9gJ
         oebA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725022998; x=1725627798;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Wp1EI73bJNLlvb/fdhUIeKhwRhBXZtilTY/v4ngKN+U=;
        b=LgpcyDFDtZ7hdshZwASpmadnREHYvTgZfobCQDRlNXV+vrhrdUlzq+gbYVCj77D5dt
         fkpGgOLqArGUjult8U8aPSMZ1NjDrgJDtozKUf6ab21ZibXpCfAWiuytoE9vGgxe3qb9
         PUZTFB5rZQmju3opZ0xglLTuy0XdZPqM+ctS5ROz0Lb0Hx+k6tpbRKnXYpSQOXdaRUrY
         t42cmARJiTCYAA1GNbNbeyHu2H4/Vdb/9Baphrfn+sy0uGEpaYUMWgtRZjPNEGdgFyrv
         cnDBNUUXLuLXXzWihqMoJaaarfmeXzRBmzjGddfr3rLcnl5gU5MADKL1ExNN1AXeWUDi
         BcQA==
X-Forwarded-Encrypted: i=1;
 AJvYcCWcGRZa56iBQGwHrsOBkrix/CQ+KNtJFFOetD78Ak9M2ysJGY3NzvBiq3aOxvb2vkh3ycsRVVnMb04cN2k=@vger.kernel.org
X-Gm-Message-State: AOJu0YwbNtLg90/erEJ+LiGGoCb3oXDqHt3JGVSVAXdhjDWYXIwBTFwL
	z20LMJ3UUFJdEvJ+yUvOMXZzhBHG2vAteLvxTXOFo06LNCAd3JHJqEOjYHmgoXg=
X-Google-Smtp-Source: 
 AGHT+IFX8zpmHavq8reISVu7Tm0irwOFXrJsHyMOkhF+Dbv+Ryx403SQBh+h9344ToameGHYo1oymg==
X-Received: by 2002:a05:600c:1ca8:b0:426:5b17:8458 with SMTP id
 5b1f17b1804b1-42bbb693013mr14983135e9.12.1725022997737;
        Fri, 30 Aug 2024 06:03:17 -0700 (PDT)
Received: from localhost.localdomain ([2a01:e0a:f:6020:3cfc:139d:f91:133])
        by smtp.gmail.com with ESMTPSA id
 ffacd0b85a97d-3749efaf35asm3954076f8f.90.2024.08.30.06.03.16
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 30 Aug 2024 06:03:17 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: mingo@redhat.com,
	peterz@infradead.org,
	juri.lelli@redhat.com,
	dietmar.eggemann@arm.com,
	rostedt@goodmis.org,
	bsegall@google.com,
	mgorman@suse.de,
	vschneid@redhat.com,
	lukasz.luba@arm.com,
	rafael.j.wysocki@intel.com,
	linux-kernel@vger.kernel.org
Cc: qyousef@layalina.io,
	hongyan.xia2@arm.com,
	Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH 3/5] sched/fair: Rework feec() to use cost instead of spare
 capacity
Date: Fri, 30 Aug 2024 15:03:07 +0200
Message-Id: <20240830130309.2141697-4-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240830130309.2141697-1-vincent.guittot@linaro.org>
References: <20240830130309.2141697-1-vincent.guittot@linaro.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

feec() looks for the CPU with highest spare capacity in a PD assuming that
it will be the best CPU from a energy efficiency PoV because it will
require the smallest increase of OPP. Although this is true generally
speaking, this policy also filters some others CPUs which will be as
efficients because of using the same OPP.
In fact, we really care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result using the same energy cost. In
these cases, we can use other metrics to select the best CPU for the same
energy cost.

Rework feec() to look 1st for the lowest cost in a PD and then the most
performant CPU between CPUs.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
 1 file changed, 244 insertions(+), 222 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e67d6029b269..2273eecf6086 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8081,29 +8081,37 @@ static unsigned long cpu_util_without(int cpu, stru=
ct task_struct *p)
 }
=20
 /*
- * energy_env - Utilization landscape for energy estimation.
- * @task_busy_time: Utilization contribution by the task for which we test=
 the
- *                  placement. Given by eenv_task_busy_time().
- * @pd_busy_time:   Utilization of the whole perf domain without the task
- *                  contribution. Given by eenv_pd_busy_time().
- * @cpu_cap:        Maximum CPU capacity for the perf domain.
- * @pd_cap:         Entire perf domain capacity. (pd->nr_cpus * cpu_cap).
- */
-struct energy_env {
-	unsigned long task_busy_time;
-	unsigned long pd_busy_time;
-	unsigned long cpu_cap;
-	unsigned long pd_cap;
+ * energy_cpu_stat - Utilization landscape for energy estimation.
+ * @idx :        Index of the OPP in the performance domain
+ * @cost :       Cost of the OPP
+ * @max_perf :   Compute capacity of OPP
+ * @min_perf :   Compute capacity of the previous OPP
+ * @capa :       Capacity of the CPU
+ * @runnable :   runnbale_avg of the CPU
+ * @nr_running : number of cfs running task
+ * @fits :       Fits level of the CPU
+ * @cpu :        current best CPU
+ */
+struct energy_cpu_stat {
+	unsigned long idx;
+	unsigned long cost;
+	unsigned long max_perf;
+	unsigned long min_perf;
+	unsigned long capa;
+	unsigned long util;
+	unsigned long runnable;
+	unsigned int nr_running;
+	int fits;
+	int cpu;
 };
=20
 /*
- * Compute the task busy time for compute_energy(). This time cannot be
- * injected directly into effective_cpu_util() because of the IRQ scaling.
+ * Compute the task busy time for computing its energy impact. This time c=
annot
+ * be injected directly into effective_cpu_util() because of the IRQ scali=
ng.
  * The latter only makes sense with the most recent CPUs where the task has
  * run.
  */
-static inline void eenv_task_busy_time(struct energy_env *eenv,
-				       struct task_struct *p, int prev_cpu)
+static inline unsigned long task_busy_time(struct task_struct *p, int prev=
_cpu)
 {
 	unsigned long busy_time, max_cap =3D arch_scale_cpu_capacity(prev_cpu);
 	unsigned long irq =3D cpu_util_irq(cpu_rq(prev_cpu));
@@ -8113,124 +8121,152 @@ static inline void eenv_task_busy_time(struct ene=
rgy_env *eenv,
 	else
 		busy_time =3D scale_irq_capacity(task_util_est(p), irq, max_cap);
=20
-	eenv->task_busy_time =3D busy_time;
+	return busy_time;
 }
=20
-/*
- * Compute the perf_domain (PD) busy time for compute_energy(). Based on t=
he
- * utilization for each @pd_cpus, it however doesn't take into account
- * clamping since the ratio (utilization / cpu_capacity) is already enough=
 to
- * scale the EM reported power consumption at the (eventually clamped)
- * cpu_capacity.
- *
- * The contribution of the task @p for which we want to estimate the
- * energy cost is removed (by cpu_util()) and must be calculated
- * separately (see eenv_task_busy_time). This ensures:
- *
- *   - A stable PD utilization, no matter which CPU of that PD we want to =
place
- *     the task on.
- *
- *   - A fair comparison between CPUs as the task contribution (task_util(=
))
- *     will always be the same no matter which CPU utilization we rely on
- *     (util_avg or util_est).
- *
- * Set @eenv busy time for the PD that spans @pd_cpus. This busy time can't
- * exceed @eenv->pd_cap.
- */
-static inline void eenv_pd_busy_time(struct energy_env *eenv,
-				     struct cpumask *pd_cpus,
-				     struct task_struct *p)
+/* Estimate the utilization of the CPU that is then used to select the OPP=
 */
+static unsigned long find_cpu_max_util(int cpu, struct task_struct *p, int=
 dst_cpu)
 {
-	unsigned long busy_time =3D 0;
-	int cpu;
+	unsigned long util =3D cpu_util(cpu, p, dst_cpu, 1);
+	unsigned long eff_util, min, max;
+
+	/*
+	 * Performance domain frequency: utilization clamping
+	 * must be considered since it affects the selection
+	 * of the performance domain frequency.
+	 */
+	eff_util =3D effective_cpu_util(cpu, util, &min, &max);
=20
-	for_each_cpu(cpu, pd_cpus) {
-		unsigned long util =3D cpu_util(cpu, p, -1, 0);
+	/* Task's uclamp can modify min and max value */
+	if (uclamp_is_used() && cpu =3D=3D dst_cpu) {
+		min =3D max(min, uclamp_eff_value(p, UCLAMP_MIN));
=20
-		busy_time +=3D effective_cpu_util(cpu, util, NULL, NULL);
+		/*
+		 * If there is no active max uclamp constraint,
+		 * directly use task's one, otherwise keep max.
+		 */
+		if (uclamp_rq_is_idle(cpu_rq(cpu)))
+			max =3D uclamp_eff_value(p, UCLAMP_MAX);
+		else
+			max =3D max(max, uclamp_eff_value(p, UCLAMP_MAX));
 	}
=20
-	eenv->pd_busy_time =3D min(eenv->pd_cap, busy_time);
+	eff_util =3D sugov_effective_cpu_perf(cpu, eff_util, min, max);
+	return eff_util;
 }
=20
-/*
- * Compute the maximum utilization for compute_energy() when the task @p
- * is placed on the cpu @dst_cpu.
- *
- * Returns the maximum utilization among @eenv->cpus. This utilization can=
't
- * exceed @eenv->cpu_cap.
- */
-static inline unsigned long
-eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
-		 struct task_struct *p, int dst_cpu)
+/* Estimate the utilization of the CPU without the task */
+static unsigned long find_cpu_actual_util(int cpu, struct task_struct *p)
 {
-	unsigned long max_util =3D 0;
-	int cpu;
+	unsigned long util =3D cpu_util(cpu, p, -1, 0);
+	unsigned long eff_util;
=20
-	for_each_cpu(cpu, pd_cpus) {
-		struct task_struct *tsk =3D (cpu =3D=3D dst_cpu) ? p : NULL;
-		unsigned long util =3D cpu_util(cpu, p, dst_cpu, 1);
-		unsigned long eff_util, min, max;
+	eff_util =3D effective_cpu_util(cpu, util, NULL, NULL);
=20
-		/*
-		 * Performance domain frequency: utilization clamping
-		 * must be considered since it affects the selection
-		 * of the performance domain frequency.
-		 * NOTE: in case RT tasks are running, by default the min
-		 * utilization can be max OPP.
-		 */
-		eff_util =3D effective_cpu_util(cpu, util, &min, &max);
+	return eff_util;
+}
=20
-		/* Task's uclamp can modify min and max value */
-		if (tsk && uclamp_is_used()) {
-			min =3D max(min, uclamp_eff_value(p, UCLAMP_MIN));
+/* Find the cost of a performance domain for the estimated utilization */
+static inline void find_pd_cost(struct em_perf_domain *pd,
+				unsigned long max_util,
+				struct energy_cpu_stat *stat)
+{
+	struct em_perf_table *em_table;
+	struct em_perf_state *ps;
+	int i;
=20
-			/*
-			 * If there is no active max uclamp constraint,
-			 * directly use task's one, otherwise keep max.
-			 */
-			if (uclamp_rq_is_idle(cpu_rq(cpu)))
-				max =3D uclamp_eff_value(p, UCLAMP_MAX);
-			else
-				max =3D max(max, uclamp_eff_value(p, UCLAMP_MAX));
-		}
+	/*
+	 * Find the lowest performance state of the Energy Model above the
+	 * requested performance.
+	 */
+	em_table =3D rcu_dereference(pd->em_table);
+	i =3D em_pd_get_efficient_state(em_table->state, pd->nr_perf_states,
+				      max_util, pd->flags);
+	ps =3D &em_table->state[i];
=20
-		eff_util =3D sugov_effective_cpu_perf(cpu, eff_util, min, max);
-		max_util =3D max(max_util, eff_util);
+	/* Save the cost and performance range of the OPP */
+	stat->max_perf =3D ps->performance;
+	stat->cost =3D ps->cost;
+	i =3D em_pd_get_previous_state(em_table->state, pd->nr_perf_states,
+				      i, pd->flags);
+	if (i < 0)
+		stat->min_perf =3D 0;
+	else {
+		ps =3D &em_table->state[i];
+		stat->min_perf =3D ps->performance;
 	}
-
-	return min(max_util, eenv->cpu_cap);
 }
=20
-/*
- * compute_energy(): Use the Energy Model to estimate the energy that @pd =
would
- * consume for a given utilization landscape @eenv. When @dst_cpu < 0, the=
 task
- * contribution is ignored.
- */
-static inline unsigned long
-compute_energy(struct energy_env *eenv, struct perf_domain *pd,
-	       struct cpumask *pd_cpus, struct task_struct *p, int dst_cpu)
+/*Check if the CPU can handle the waking task */
+static int check_cpu_with_task(struct task_struct *p, int cpu)
 {
-	unsigned long max_util =3D eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
-	unsigned long busy_time =3D eenv->pd_busy_time;
-	unsigned long energy;
+	unsigned long p_util_min =3D uclamp_is_used() ? uclamp_eff_value(p, UCLAM=
P_MIN) : 0;
+	unsigned long p_util_max =3D uclamp_is_used() ? uclamp_eff_value(p, UCLAM=
P_MAX) : 1024;
+	unsigned long util_min =3D p_util_min;
+	unsigned long util_max =3D p_util_max;
+	unsigned long util =3D cpu_util(cpu, p, cpu, 0);
+	struct rq *rq =3D cpu_rq(cpu);
=20
-	if (dst_cpu >=3D 0)
-		busy_time =3D min(eenv->pd_cap, busy_time + eenv->task_busy_time);
+	/*
+	 * Skip CPUs that cannot satisfy the capacity request.
+	 * IOW, placing the task there would make the CPU
+	 * overutilized. Take uclamp into account to see how
+	 * much capacity we can get out of the CPU; this is
+	 * aligned with sched_cpu_util().
+	 */
+	if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
+		unsigned long rq_util_min, rq_util_max;
+		/*
+		 * Open code uclamp_rq_util_with() except for
+		 * the clamp() part. I.e.: apply max aggregation
+		 * only. util_fits_cpu() logic requires to
+		 * operate on non clamped util but must use the
+		 * max-aggregated uclamp_{min, max}.
+		 */
+		rq_util_min =3D uclamp_rq_get(rq, UCLAMP_MIN);
+		rq_util_max =3D uclamp_rq_get(rq, UCLAMP_MAX);
+		util_min =3D max(rq_util_min, p_util_min);
+		util_max =3D max(rq_util_max, p_util_max);
+	}
+	return util_fits_cpu(util, util_min, util_max, cpu);
+}
=20
-	energy =3D em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
+/* For a same cost, select the CPU that will povide best performance for t=
he task */
+static bool select_best_cpu(struct energy_cpu_stat *target,
+			    struct energy_cpu_stat *min,
+			    int prev, struct sched_domain *sd)
+{
+	/*  Select the one with the least number of running tasks */
+	if (target->nr_running < min->nr_running)
+		return true;
+	if (target->nr_running > min->nr_running)
+		return false;
=20
-	trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
+	/* Favor previous CPU otherwise */
+	if (target->cpu =3D=3D prev)
+		return true;
+	if (min->cpu =3D=3D prev)
+		return false;
=20
-	return energy;
+	/*
+	 * Choose CPU with lowest contention. One might want to consider load ins=
tead of
+	 * runnable but we are supposed to not be overutilized so there is enough=
 compute
+	 * capacity for everybody.
+	 */
+	if ((target->runnable * min->capa * sd->imbalance_pct) >=3D
+			(min->runnable * target->capa * 100))
+		return false;
+
+	return true;
 }
=20
 /*
  * find_energy_efficient_cpu(): Find most energy-efficient target CPU for =
the
- * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
- * spare capacity in each performance domain and uses it as a potential
- * candidate to execute the task. Then, it uses the Energy Model to figure
- * out which of the CPU candidates is the most energy-efficient.
+ * waking task. find_energy_efficient_cpu() looks for the CPU with the low=
est
+ * power cost (usually with maximum spare capacity but not always) in each
+ * performance domain and uses it as a potential candidate to execute the =
task.
+ * Then, it uses the Energy Model to figure out which of the CPU candidate=
s is
+ * the most energy-efficient.
  *
  * The rationale for this heuristic is as follows. In a performance domain,
  * all the most energy efficient CPU candidates (according to the Energy
@@ -8267,17 +8303,14 @@ compute_energy(struct energy_env *eenv, struct perf=
_domain *pd,
 static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct cpumask *cpus =3D this_cpu_cpumask_var_ptr(select_rq_mask);
-	unsigned long prev_delta =3D ULONG_MAX, best_delta =3D ULONG_MAX;
-	unsigned long p_util_min =3D uclamp_is_used() ? uclamp_eff_value(p, UCLAM=
P_MIN) : 0;
-	unsigned long p_util_max =3D uclamp_is_used() ? uclamp_eff_value(p, UCLAM=
P_MAX) : 1024;
+	unsigned long task_util;
+	unsigned long best_nrg =3D ULONG_MAX;
+	int best_fits =3D -1;
+	int best_cpu =3D -1;
 	struct root_domain *rd =3D this_rq()->rd;
-	int cpu, best_energy_cpu, target =3D -1;
-	int prev_fits =3D -1, best_fits =3D -1;
-	unsigned long best_actual_cap =3D 0;
-	unsigned long prev_actual_cap =3D 0;
+	int cpu, target =3D -1;
 	struct sched_domain *sd;
 	struct perf_domain *pd;
-	struct energy_env eenv;
=20
 	rcu_read_lock();
 	pd =3D rcu_dereference(rd->pd);
@@ -8296,20 +8329,21 @@ static int find_energy_efficient_cpu(struct task_st=
ruct *p, int prev_cpu)
=20
 	target =3D prev_cpu;
=20
-	sync_entity_load_avg(&p->se);
-	if (!task_util_est(p) && p_util_min =3D=3D 0)
-		goto unlock;
=20
-	eenv_task_busy_time(&eenv, p, prev_cpu);
+	sync_entity_load_avg(&p->se);
+	task_util =3D task_busy_time(p, prev_cpu);
=20
 	for (; pd; pd =3D pd->next) {
-		unsigned long util_min =3D p_util_min, util_max =3D p_util_max;
-		unsigned long cpu_cap, cpu_actual_cap, util;
-		long prev_spare_cap =3D -1, max_spare_cap =3D -1;
-		unsigned long rq_util_min, rq_util_max;
-		unsigned long cur_delta, base_energy;
-		int max_spare_cap_cpu =3D -1;
-		int fits, max_fits =3D -1;
+		unsigned long cpu_actual_cap, max_cost =3D 0;
+		unsigned long pd_actual_util =3D 0, delta_nrg =3D 0;
+		struct energy_cpu_stat target_stat;
+		struct energy_cpu_stat min_stat =3D {
+			.cost =3D ULONG_MAX,
+			.max_perf =3D ULONG_MAX,
+			.min_perf =3D ULONG_MAX,
+			.fits =3D -2,
+			.cpu =3D -1,
+		};
=20
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
=20
@@ -8320,13 +8354,9 @@ static int find_energy_efficient_cpu(struct task_str=
uct *p, int prev_cpu)
 		cpu =3D cpumask_first(cpus);
 		cpu_actual_cap =3D get_actual_cpu_capacity(cpu);
=20
-		eenv.cpu_cap =3D cpu_actual_cap;
-		eenv.pd_cap =3D 0;
-
+		/* In a PD, the CPU with the lowest cost will be the most efficient */
 		for_each_cpu(cpu, cpus) {
-			struct rq *rq =3D cpu_rq(cpu);
-
-			eenv.pd_cap +=3D cpu_actual_cap;
+			unsigned long target_perf;
=20
 			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
 				continue;
@@ -8334,120 +8364,112 @@ static int find_energy_efficient_cpu(struct task_=
struct *p, int prev_cpu)
 			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
=20
-			util =3D cpu_util(cpu, p, cpu, 0);
-			cpu_cap =3D capacity_of(cpu);
+			target_stat.fits =3D check_cpu_with_task(p, cpu);
+
+			if (!target_stat.fits)
+				continue;
+
+			/* 1st select the CPU that fits best */
+			if (target_stat.fits < min_stat.fits)
+				continue;
+
+			/* Then select the CPU with lowest cost */
+
+			/* Get the performance of the CPU w/ waking task. */
+			target_perf =3D find_cpu_max_util(cpu, p, cpu);
+			target_perf =3D min(target_perf, cpu_actual_cap);
+
+			/* Needing a higher OPP means a higher cost */
+			if (target_perf > min_stat.max_perf)
+				continue;
=20
 			/*
-			 * Skip CPUs that cannot satisfy the capacity request.
-			 * IOW, placing the task there would make the CPU
-			 * overutilized. Take uclamp into account to see how
-			 * much capacity we can get out of the CPU; this is
-			 * aligned with sched_cpu_util().
+			 * At this point, target's cost can be either equal or
+			 * lower than the current minimum cost.
 			 */
-			if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
-				/*
-				 * Open code uclamp_rq_util_with() except for
-				 * the clamp() part. I.e.: apply max aggregation
-				 * only. util_fits_cpu() logic requires to
-				 * operate on non clamped util but must use the
-				 * max-aggregated uclamp_{min, max}.
-				 */
-				rq_util_min =3D uclamp_rq_get(rq, UCLAMP_MIN);
-				rq_util_max =3D uclamp_rq_get(rq, UCLAMP_MAX);
=20
-				util_min =3D max(rq_util_min, p_util_min);
-				util_max =3D max(rq_util_max, p_util_max);
-			}
+			/* Gather more statistics */
+			target_stat.cpu =3D cpu;
+			target_stat.runnable =3D cpu_runnable(cpu_rq(cpu));
+			target_stat.capa =3D capacity_of(cpu);
+			target_stat.nr_running =3D cpu_rq(cpu)->cfs.h_nr_running;
=20
-			fits =3D util_fits_cpu(util, util_min, util_max, cpu);
-			if (!fits)
+			/* If the target needs a lower OPP, then look up for
+			 * the corresponding OPP and its associated cost.
+			 * Otherwise at same cost level, select the CPU which
+			 * provides best performance.
+			 */
+			if (target_perf < min_stat.min_perf)
+				find_pd_cost(pd->em_pd, target_perf, &target_stat);
+			else if (!select_best_cpu(&target_stat, &min_stat, prev_cpu, sd))
 				continue;
=20
-			lsub_positive(&cpu_cap, util);
-
-			if (cpu =3D=3D prev_cpu) {
-				/* Always use prev_cpu as a candidate. */
-				prev_spare_cap =3D cpu_cap;
-				prev_fits =3D fits;
-			} else if ((fits > max_fits) ||
-				   ((fits =3D=3D max_fits) && ((long)cpu_cap > max_spare_cap))) {
-				/*
-				 * Find the CPU with the maximum spare capacity
-				 * among the remaining CPUs in the performance
-				 * domain.
-				 */
-				max_spare_cap =3D cpu_cap;
-				max_spare_cap_cpu =3D cpu;
-				max_fits =3D fits;
-			}
+			/* Save the new most efficient CPU of the PD */
+			min_stat =3D target_stat;
 		}
=20
-		if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
+		if (min_stat.cpu =3D=3D -1)
 			continue;
=20
-		eenv_pd_busy_time(&eenv, cpus, p);
-		/* Compute the 'base' energy of the pd, without @p */
-		base_energy =3D compute_energy(&eenv, pd, cpus, p, -1);
+		if (min_stat.fits < best_fits)
+			continue;
=20
-		/* Evaluate the energy impact of using prev_cpu. */
-		if (prev_spare_cap > -1) {
-			prev_delta =3D compute_energy(&eenv, pd, cpus, p,
-						    prev_cpu);
-			/* CPU utilization has changed */
-			if (prev_delta < base_energy)
-				goto unlock;
-			prev_delta -=3D base_energy;
-			prev_actual_cap =3D cpu_actual_cap;
-			best_delta =3D min(best_delta, prev_delta);
-		}
+		/* Idle system costs nothing */
+		target_stat.max_perf =3D 0;
+		target_stat.cost =3D 0;
=20
-		/* Evaluate the energy impact of using max_spare_cap_cpu. */
-		if (max_spare_cap_cpu >=3D 0 && max_spare_cap > prev_spare_cap) {
-			/* Current best energy cpu fits better */
-			if (max_fits < best_fits)
-				continue;
+		/* Estimate utilization and cost without p */
+		for_each_cpu(cpu, cpus) {
+			unsigned long target_util;
=20
-			/*
-			 * Both don't fit performance hint (i.e. uclamp_min)
-			 * but best energy cpu has better capacity.
-			 */
-			if ((max_fits < 0) &&
-			    (cpu_actual_cap <=3D best_actual_cap))
-				continue;
+			/* Accumulate actual utilization w/o task p */
+			pd_actual_util +=3D find_cpu_actual_util(cpu, p);
=20
-			cur_delta =3D compute_energy(&eenv, pd, cpus, p,
-						   max_spare_cap_cpu);
-			/* CPU utilization has changed */
-			if (cur_delta < base_energy)
-				goto unlock;
-			cur_delta -=3D base_energy;
+			/* Get the max utilization of the CPU w/o task p */
+			target_util =3D find_cpu_max_util(cpu, p, -1);
+			target_util =3D min(target_util, cpu_actual_cap);
=20
-			/*
-			 * Both fit for the task but best energy cpu has lower
-			 * energy impact.
-			 */
-			if ((max_fits > 0) && (best_fits > 0) &&
-			    (cur_delta >=3D best_delta))
+			/* Current OPP is enough */
+			if (target_util <=3D target_stat.max_perf)
 				continue;
=20
-			best_delta =3D cur_delta;
-			best_energy_cpu =3D max_spare_cap_cpu;
-			best_fits =3D max_fits;
-			best_actual_cap =3D cpu_actual_cap;
+			/* Compute and save the cost of the OPP */
+			find_pd_cost(pd->em_pd, target_util, &target_stat);
+			max_cost =3D target_stat.cost;
 		}
-	}
-	rcu_read_unlock();
=20
-	if ((best_fits > prev_fits) ||
-	    ((best_fits > 0) && (best_delta < prev_delta)) ||
-	    ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
-		target =3D best_energy_cpu;
+		/* Add the NRG cost of p */
+		delta_nrg =3D task_util * min_stat.cost;
=20
-	return target;
+		/* Compute the NRG cost of others running at higher OPP because of p */
+		if (min_stat.cost > max_cost)
+			delta_nrg +=3D pd_actual_util * (min_stat.cost - max_cost);
+
+		/* nrg with p */
+		trace_sched_compute_energy_tp(p, min_stat.cpu, delta_nrg,
+				min_stat.max_perf, pd_actual_util + task_util);
+
+		/*
+		 * The probability that delta NRGs are equals is almost null. PDs being =
sorted
+		 * by max capacity, keep the one with highest max capacity if this
+		 * happens.
+		 * TODO: add a margin in nrg cost and take into account other stats
+		 */
+		if ((min_stat.fits =3D=3D best_fits) &&
+		    (delta_nrg >=3D best_nrg))
+			continue;
+
+		best_fits =3D min_stat.fits;
+		best_nrg =3D delta_nrg;
+		best_cpu =3D min_stat.cpu;
+	}
=20
 unlock:
 	rcu_read_unlock();
=20
+	if (best_cpu >=3D 0)
+		target =3D best_cpu;
+
 	return target;
 }
=20
--=20
2.34.1
From nobody Mon Feb  9 14:06:01 2026
Received: from mail-wm1-f51.google.com (mail-wm1-f51.google.com
 [209.85.128.51])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 986461B6524
	for <linux-kernel@vger.kernel.org>; Fri, 30 Aug 2024 13:03:20 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.51
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1725023002; cv=none;
 b=ExlITIiJgAHiRS6Di67kfDzKYKYfOMFxSGIB5DOzCALVgKwTzbmRoad8CoLLMlhAColxI+CCLVFZqtOWUGkGm/gtdGLte3LcjXYKgaGcozhcIwLTQcZaIif20rUhZ7G4D/QpvRE8cqYqLgLcxGQYiUicHZX68kb7xWOxdWqy7mE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1725023002; c=relaxed/simple;
	bh=K1fPShadi9E5vUmcMt9lE2cliV+N/hB8gQUhdxIGaL0=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=izDws72qfjUPkqYvRiRMh1wuGFJF60U1G/Og0n1fiT42yt66fV+KmXabsJoQE85qGyxtFGy1aPEO+uYVyGz/iEzlHobS+Mmn47HgBAKWClLCwzJ4N2+aPhOoBk9e2DJytR8/TP37JpGnJK5fIADBATXVXlrN87ovKSTgPRsCAzY=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linaro.org;
 spf=pass smtp.mailfrom=linaro.org;
 dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org
 header.b=DXWCcfjU; arc=none smtp.client-ip=209.85.128.51
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linaro.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org
 header.b="DXWCcfjU"
Received: by mail-wm1-f51.google.com with SMTP id
 5b1f17b1804b1-42bb7298bdeso19278715e9.1
        for <linux-kernel@vger.kernel.org>;
 Fri, 30 Aug 2024 06:03:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1725022999; x=1725627799;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=VzDEXFxbEwVnz1bNBEnZqyQC9wBiudp2iSYZq/SnOQk=;
        b=DXWCcfjUEBSh+NRTPqKXGuw+uDwZAfvWZ84MYzkvEPiSz20rj9sX7mJz5+gc0GLJsw
         30k1KAYJAPyTHTDyh405sW2H8/9PFBegErEj40C+QuDdxjd6KJ0/H5H4DQwCPFGSonua
         h5ejS+c24+vX/rbP0LlIpt6JdRnXonursr6uczy+CmQ8N8xjFa7rkeNTx33amrcf/+as
         rsIkbTbyGxBPcRjzuNDuB0eFPLSigTU0aGiy3M/LrQi/2repDMdbbs0sJHkHIxWe6uez
         zfC7tClCY/aQVZ/P8bk9bMFZeFQOLV3rZpWCsM+ZsaCI7yzcBOYNm6rZ97533GLXoBb1
         oi4Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725022999; x=1725627799;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=VzDEXFxbEwVnz1bNBEnZqyQC9wBiudp2iSYZq/SnOQk=;
        b=JGp9tiKz3Md8tJh5UrB+S7WpK5VtkL7392XQXJMaNIy5pyn79RuzAcMUWp5RLw7ONk
         4U/1CArbN7Q0Pbal4DogMwVRvBNp3ozeqmUyZ7qdcHWyt5HntWAhGU2DAiOOHcovCa2O
         oR3PrnvJ63xCidoNABcorsR9vBH2gFm3t4BS31DG4+hGB7mPvgNKoftyvCnEUJQAe4oS
         zQZGSaBOBte2OlkOyAK2K4VwNlda0QPNgmWhfDJnjpVpZsylugSFL7EL2LZ2k0FoPF2x
         amMIm5+LYg1J1cphY3iuwTUCc8rk5+9EaNjZz+nLeZqpyjLBeec8fRTc5jf7IbWRY4Xr
         IRcA==
X-Forwarded-Encrypted: i=1;
 AJvYcCV9cS9EQxHMb2ZXhPQYlSWnmPjeQh+DhmHHHYVqCKfKpfC2xeGH/nYdqzETIKPSTH2nLAeUewTGX/XJhTk=@vger.kernel.org
X-Gm-Message-State: AOJu0YxK3fMSRTCa6Q3H/Idzeud93TiUik2aAXX2amFincyd4QTbKg79
	XjVLwzgxSHkSMrTGR/atMGkLTWHwJrcO62UWn53hjtVktQ4QL/IYiJJuL1UpZQM=
X-Google-Smtp-Source: 
 AGHT+IFMJSNiRGZBIURpFxE4qN7vqdlat34iAKRWtJfMIFFxjsF/Q08a5RWNfKAfAhiKtI9BsnzEnQ==
X-Received: by 2002:a05:600c:1d8b:b0:42b:892a:3296 with SMTP id
 5b1f17b1804b1-42bb020d4eemr52457355e9.37.1725022998709;
        Fri, 30 Aug 2024 06:03:18 -0700 (PDT)
Received: from localhost.localdomain ([2a01:e0a:f:6020:3cfc:139d:f91:133])
        by smtp.gmail.com with ESMTPSA id
 ffacd0b85a97d-3749efaf35asm3954076f8f.90.2024.08.30.06.03.17
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 30 Aug 2024 06:03:18 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: mingo@redhat.com,
	peterz@infradead.org,
	juri.lelli@redhat.com,
	dietmar.eggemann@arm.com,
	rostedt@goodmis.org,
	bsegall@google.com,
	mgorman@suse.de,
	vschneid@redhat.com,
	lukasz.luba@arm.com,
	rafael.j.wysocki@intel.com,
	linux-kernel@vger.kernel.org
Cc: qyousef@layalina.io,
	hongyan.xia2@arm.com,
	Vincent Guittot <vincent.guittot@linaro.org>
Subject: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
Date: Fri, 30 Aug 2024 15:03:08 +0200
Message-Id: <20240830130309.2141697-5-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240830130309.2141697-1-vincent.guittot@linaro.org>
References: <20240830130309.2141697-1-vincent.guittot@linaro.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Keep looking for an energy efficient CPU even when the system is
overutilized and use the CPU returned by feec() if it has been able to find
one. Otherwise fallback to the default performance and spread mode of the
scheduler.
A system can become overutilized for a short time when workers of a
workqueue wake up for a short background work like vmstat update.
Continuing to look for a energy efficient CPU will prevent to break the
power packing of tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2273eecf6086..e46af2416159 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_c=
pu, int wake_flags)
 		    cpumask_test_cpu(cpu, p->cpus_ptr))
 			return cpu;
=20
-		if (!is_rd_overutilized(this_rq()->rd)) {
+		if (sched_energy_enabled()) {
 			new_cpu =3D find_energy_efficient_cpu(p, prev_cpu);
 			if (new_cpu >=3D 0)
 				return new_cpu;
--=20
2.34.1
From nobody Mon Feb  9 14:06:01 2026
Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com
 [209.85.128.47])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B217B1B78EA
	for <linux-kernel@vger.kernel.org>; Fri, 30 Aug 2024 13:03:21 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.47
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1725023003; cv=none;
 b=jyij0I4kjHp8S/OOCVND9Blu7IxXj5bEQx+aKHExsQfBeVO+pFlX5fvD84899u2bkBWOkeGxfoAhI9OcxinUIjZEyZS2DRsgAEunPEO31iI6a7ASMGDls400XyKRXyT7XuTG9owdPWCgtodVSHi/MrEevYZuziiUTUwx7DHd0cc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1725023003; c=relaxed/simple;
	bh=8pKBSrpwiuzaUXhTOgUmrL8vOGNInghVvRONHy4UxEQ=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=iU9adH/Rf+ZuH7cgxmwyw4fi8XbxG59hyh+jwE33gf1DlvSGPKsBdxzD8kUbQvdBMl2HC/hNcvHd1K7m7O7ncg8zv59DZrkOHVHwHmjhQszRLCYSbhk6LZqpkdFqLrZYT2h0UFwBsKgwY0t4ARySUw0lb6MdeMEzpG8KfR/qJP0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linaro.org;
 spf=pass smtp.mailfrom=linaro.org;
 dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org
 header.b=udcgTNVw; arc=none smtp.client-ip=209.85.128.47
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linaro.org
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linaro.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org
 header.b="udcgTNVw"
Received: by mail-wm1-f47.google.com with SMTP id
 5b1f17b1804b1-42bbe809b06so3382775e9.1
        for <linux-kernel@vger.kernel.org>;
 Fri, 30 Aug 2024 06:03:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google; t=1725023000; x=1725627800;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=6AZbyU444PflN+GWii/bOJJEcyRPxrIzNhhYkaQbt4I=;
        b=udcgTNVwxKOwn3ElWHK/Svxv4BChnkL3gfHMImddpI9Qtjcx5qo2GUrVcJRyRXg3Uh
         CA7eVcEfukXuIpObtVQ3NbWF7zGJy7XeD3OpYDY2PzEoENc+yhIPOQFIor0ddfONignW
         BL7Fp3cb59KX7BhKdiF5pENxfSGzBelgA9RiBLBtEWLDmyfqy0Pqy/15OTRQpixb/QP8
         VSKyw5q/rUwbhwxraxZ4bvE+23Jow/APJfz28zxACfmEVIts3AarEMKmp95VxV5MWMZJ
         bQhP0y1dxjqm3lNy5LQ3jYl8yB7tOlC98yvWTmhDsB05tKDLAG2AcyxFF+wo1lE9ll4m
         vusA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1725023000; x=1725627800;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=6AZbyU444PflN+GWii/bOJJEcyRPxrIzNhhYkaQbt4I=;
        b=mnxWQZT2ADx4OIQt15hOy+U/AyHipXYXLt6GRlWAF2QIdfY5hP0rvG9BmlSNrRpXRS
         kehDEOaHkZfoCg4e9TqWHlSevHL5RISAipNbIKLFBczQMSlqWrwUHSoiRDUFBTbL21Qq
         6HxdfH/IuGngHKqic2POWpyTDECCCQmBFri4x9sZVT2MZDdzVwKueWfh/dNcb727okER
         05qV56hR12etEyRKc4pF6BbtVq25yR3KMS/0yKZjJsWYDHCbulPs8E/pYMfee+eG1ISk
         L32yBHA3/oRT2wEcbhFot9NBYi3QK/ff1xAA9mXPXGypzvxxdOSpvkneBqEZwKV2WFku
         TD3g==
X-Forwarded-Encrypted: i=1;
 AJvYcCWBXHGizJo8A85iGVZ3/xKbOAJH6N7SNZ+AAfEk+Rpq6IRVSckl3z88BOFJ0FL2eKmX12YrM70lWxo0KY0=@vger.kernel.org
X-Gm-Message-State: AOJu0YwKmOjWkR1j8xhVYqkqk1aMEzs/6UZzPgzOmOSB6G4AptJ0Tln7
	7cAUxe2pt1MaW/8FvpG+FjhDlVjVfO3Ip/RoDonnaCTN9U7MjMboYwGMbnIJTWQ=
X-Google-Smtp-Source: 
 AGHT+IHUKl2vSXfnFoB5bubvWIJnlyFWhcc9ZKlUqMsH5dXPHD53Y5F2O2hN5vOC25T1imv+CmylsA==
X-Received: by 2002:a05:600c:4505:b0:426:60b8:d8ba with SMTP id
 5b1f17b1804b1-42bb27a9ca8mr51463945e9.28.1725022999723;
        Fri, 30 Aug 2024 06:03:19 -0700 (PDT)
Received: from localhost.localdomain ([2a01:e0a:f:6020:3cfc:139d:f91:133])
        by smtp.gmail.com with ESMTPSA id
 ffacd0b85a97d-3749efaf35asm3954076f8f.90.2024.08.30.06.03.18
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 30 Aug 2024 06:03:19 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: mingo@redhat.com,
	peterz@infradead.org,
	juri.lelli@redhat.com,
	dietmar.eggemann@arm.com,
	rostedt@goodmis.org,
	bsegall@google.com,
	mgorman@suse.de,
	vschneid@redhat.com,
	lukasz.luba@arm.com,
	rafael.j.wysocki@intel.com,
	linux-kernel@vger.kernel.org
Cc: qyousef@layalina.io,
	hongyan.xia2@arm.com,
	Vincent Guittot <vincent.guittot@linaro.org>
Subject: [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
Date: Fri, 30 Aug 2024 15:03:09 +0200
Message-Id: <20240830130309.2141697-6-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240830130309.2141697-1-vincent.guittot@linaro.org>
References: <20240830130309.2141697-1-vincent.guittot@linaro.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

EAS is based on wakeup events to efficiently place tasks on the system, but
there are cases where a task will not have wakeup events anymore or at a
far too low pace. For such situation, we can take advantage of the task
being put back in the enqueued list to check if it should be migrated on
another CPU. When the task is the only one running on the CPU, the tick
will check it the task is stuck on this CPU and should migrate on another
one.

Wake up events remain the main way to migrate tasks but we now detect
situation where a task is stuck on a CPU by checking that its utilization
is larger than the max available compute capacity (max cpu capacity or
uclamp max setting)

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 211 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   2 +
 2 files changed, 213 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e46af2416159..41fb18ac118b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5455,6 +5455,7 @@ static void clear_buddies(struct cfs_rq *cfs_rq, stru=
ct sched_entity *se)
 }
=20
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+static void dequeue_pushable_task(struct cfs_rq *cfs_rq, struct sched_enti=
ty *se, bool queue);
=20
 static bool
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
@@ -5463,6 +5464,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_en=
tity *se, int flags)
=20
 	update_curr(cfs_rq);
=20
+	dequeue_pushable_task(cfs_rq, se, false);
+
 	if (flags & DEQUEUE_DELAYED) {
 		SCHED_WARN_ON(!se->sched_delayed);
 	} else {
@@ -5585,6 +5588,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_e=
ntity *se)
 	}
=20
 	se->prev_sum_exec_runtime =3D se->sum_exec_runtime;
+
+	dequeue_pushable_task(cfs_rq, se, true);
 }
=20
 static int dequeue_entities(struct rq *rq, struct sched_entity *se, int fl=
ags);
@@ -5620,6 +5625,7 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 }
=20
 static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+static void enqueue_pushable_task(struct cfs_rq *cfs_rq, struct sched_enti=
ty *se);
=20
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *pr=
ev)
 {
@@ -5639,9 +5645,16 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, s=
truct sched_entity *prev)
 		__enqueue_entity(cfs_rq, prev);
 		/* in !on_rq case, update occurred at dequeue */
 		update_load_avg(cfs_rq, prev, 0);
+
+		/*
+		 * The previous task might be eligible for pushing it on
+		 * another cpu if it is still active.
+		 */
+		enqueue_pushable_task(cfs_rq, prev);
 	}
 	SCHED_WARN_ON(cfs_rq->curr !=3D prev);
 	cfs_rq->curr =3D NULL;
+
 }
=20
 static void
@@ -8393,6 +8406,8 @@ static int find_energy_efficient_cpu(struct task_stru=
ct *p, int prev_cpu)
 			target_stat.runnable =3D cpu_runnable(cpu_rq(cpu));
 			target_stat.capa =3D capacity_of(cpu);
 			target_stat.nr_running =3D cpu_rq(cpu)->cfs.h_nr_running;
+			if ((p->on_rq) && (cpu =3D=3D prev_cpu))
+				target_stat.nr_running--;
=20
 			/* If the target needs a lower OPP, then look up for
 			 * the corresponding OPP and its associated cost.
@@ -8473,6 +8488,197 @@ static int find_energy_efficient_cpu(struct task_st=
ruct *p, int prev_cpu)
 	return target;
 }
=20
+static inline bool task_misfit_cpu(struct task_struct *p, int cpu)
+{
+	unsigned long max_capa =3D get_actual_cpu_capacity(cpu);
+	unsigned long util =3D task_util_est(p);
+
+	max_capa =3D min(max_capa, uclamp_eff_value(p, UCLAMP_MAX));
+	util =3D max(util, task_runnable(p));
+
+	/*
+	 * Return true only if the task might not sleep/wakeup because of a low
+	 * compute capacity. Tasks, which wake up regularly, will be handled by
+	 * feec().
+	 */
+	return (util > max_capa);
+}
+
+static int active_load_balance_cpu_stop(void *data);
+
+static inline void check_misfit_cpu(struct task_struct *p, struct rq *rq)
+{
+	int new_cpu, cpu =3D cpu_of(rq);
+
+	if (!sched_energy_enabled())
+		return;
+
+	if (WARN_ON(!p))
+		return;
+
+	if (WARN_ON(p !=3D rq->curr))
+		return;
+
+	if (is_migration_disabled(p))
+		return;
+
+	if ((rq->nr_running > 1) || (p->nr_cpus_allowed =3D=3D 1))
+		return;
+
+	if (!task_misfit_cpu(p, cpu))
+		return;
+
+	new_cpu =3D find_energy_efficient_cpu(p, cpu);
+
+	if (new_cpu =3D=3D cpu)
+		return;
+
+	/*
+	 * ->active_balance synchronizes accesses to
+	 * ->active_balance_work.  Once set, it's cleared
+	 * only after active load balance is finished.
+	 */
+	if (!rq->active_balance) {
+		rq->active_balance =3D 1;
+		rq->push_cpu =3D new_cpu;
+	} else
+		return;
+
+	raw_spin_rq_unlock(rq);
+	stop_one_cpu_nowait(cpu,
+		active_load_balance_cpu_stop, rq,
+		&rq->active_balance_work);
+	raw_spin_rq_lock(rq);
+}
+
+static inline int has_pushable_tasks(struct rq *rq)
+{
+	return !plist_head_empty(&rq->cfs.pushable_tasks);
+}
+
+static struct task_struct *pick_next_pushable_fair_task(struct rq *rq)
+{
+	struct task_struct *p;
+
+	if (!has_pushable_tasks(rq))
+		return NULL;
+
+	p =3D plist_first_entry(&rq->cfs.pushable_tasks,
+			      struct task_struct, pushable_tasks);
+
+	WARN_ON_ONCE(rq->cpu !=3D task_cpu(p));
+	WARN_ON_ONCE(task_current(rq, p));
+	WARN_ON_ONCE(p->nr_cpus_allowed <=3D 1);
+
+	WARN_ON_ONCE(!task_on_rq_queued(p));
+
+	/*
+	 * Remove task from the pushable list as we try only once after
+	 * task has been put back in enqueued list.
+	 */
+	plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+
+	return p;
+}
+
+/*
+ * See if the non running fair tasks on this rq
+ * can be sent to some other CPU that fits better with
+ * their profile.
+ */
+static int push_fair_task(struct rq *rq)
+{
+	struct task_struct *next_task;
+	struct rq *new_rq;
+	int prev_cpu, new_cpu;
+	int ret =3D 0;
+
+	next_task =3D pick_next_pushable_fair_task(rq);
+	if (!next_task)
+		return 0;
+
+	if (is_migration_disabled(next_task))
+		return 0;
+
+	if (WARN_ON(next_task =3D=3D rq->curr))
+		return 0;
+
+	/* We might release rq lock */
+	get_task_struct(next_task);
+
+	prev_cpu =3D rq->cpu;
+
+	new_cpu =3D find_energy_efficient_cpu(next_task, prev_cpu);
+
+	if (new_cpu =3D=3D prev_cpu)
+		goto out;
+
+	new_rq =3D cpu_rq(new_cpu);
+
+	if (double_lock_balance(rq, new_rq)) {
+
+		deactivate_task(rq, next_task, 0);
+		set_task_cpu(next_task, new_cpu);
+		activate_task(new_rq, next_task, 0);
+		ret =3D 1;
+
+		resched_curr(new_rq);
+
+		double_unlock_balance(rq, new_rq);
+	}
+
+out:
+	put_task_struct(next_task);
+
+	return ret;
+}
+
+static void push_fair_tasks(struct rq *rq)
+{
+	/* push_dl_task() will return true if it moved a -deadline task */
+	while (push_fair_task(rq))
+		;
+}
+
+static DEFINE_PER_CPU(struct balance_callback, fair_push_head);
+
+static inline void fair_queue_push_tasks(struct rq *rq)
+{
+	if (!sched_energy_enabled() || !has_pushable_tasks(rq))
+		return;
+
+	queue_balance_callback(rq, &per_cpu(fair_push_head, rq->cpu), push_fair_t=
asks);
+}
+static void dequeue_pushable_task(struct cfs_rq *cfs_rq, struct sched_enti=
ty *se, bool queue)
+{
+	struct task_struct *p;
+	struct rq *rq;
+
+	if (sched_energy_enabled() && entity_is_task(se)) {
+		rq =3D rq_of(cfs_rq);
+		p =3D container_of(se, struct task_struct, se);
+
+		plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+
+		if (queue)
+			fair_queue_push_tasks(rq);
+	}
+}
+
+static void enqueue_pushable_task(struct cfs_rq *cfs_rq, struct sched_enti=
ty *se)
+{
+	if (sched_energy_enabled() && entity_is_task(se)) {
+		struct task_struct *p =3D container_of(se, struct task_struct, se);
+		struct rq *rq =3D rq_of(cfs_rq);
+
+		if ((p->nr_cpus_allowed > 1) && task_misfit_cpu(p, rq->cpu)) {
+			plist_del(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+			plist_node_init(&p->pushable_tasks, p->prio);
+			plist_add(&p->pushable_tasks, &rq->cfs.pushable_tasks);
+		}
+	}
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in doma=
ins
  * that have the relevant SD flag set. In practice, this is SD_BALANCE_WAK=
E,
@@ -8642,6 +8848,8 @@ balance_fair(struct rq *rq, struct task_struct *prev,=
 struct rq_flags *rf)
 	return sched_balance_newidle(rq, rf) !=3D 0;
 }
 #else
+static inline void dequeue_pushable_task(struct cfs_rq *cfs_rq, struct sch=
ed_entity *se, bool queue) {}
+static inline void enqueue_pushable_task(struct cfs_rq *cfs_rq, struct sch=
ed_entity *se) {}
 static inline void set_task_max_allowed_capacity(struct task_struct *p) {}
 #endif /* CONFIG_SMP */
=20
@@ -13013,6 +13221,8 @@ static void task_tick_fair(struct rq *rq, struct ta=
sk_struct *curr, int queued)
 	check_update_overutilized_status(task_rq(curr));
=20
 	task_tick_core(rq, curr);
+
+	check_misfit_cpu(curr, rq);
 }
=20
 /*
@@ -13204,6 +13414,7 @@ static void set_next_task_fair(struct rq *rq, struc=
t task_struct *p, bool first)
 void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline =3D RB_ROOT_CACHED;
+	plist_head_init(&cfs_rq->pushable_tasks);
 	cfs_rq->min_vruntime =3D (u64)(-(1LL << 20));
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2f5d658c0631..f3327695d4a3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -672,6 +672,8 @@ struct cfs_rq {
 	struct list_head	leaf_cfs_rq_list;
 	struct task_group	*tg;	/* group that "owns" this runqueue */
=20
+	struct plist_head	pushable_tasks;
+
 	/* Locally cached copy of our task_group's idle value */
 	int			idle;
=20
--=20
2.34.1