From nobody Mon Feb  9 07:19:28 2026
Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id DF08528EC
	for <linux-kernel@vger.kernel.org>; Thu, 26 Dec 2024 02:09:41 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=45.249.212.188
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1735178985; cv=none;
 b=t87ie+HppfTsVNNSZO1asqe23NU6reBTtvfPfoxZt5zQ7aiT6DO9hza/1Nts6VJP2Bd4zQdESMUNkXJ7rXdu1rSzWfA/fpVAfLgyBzSyjZYLPP6VNExxfYfdtPV7G2dgk7nt1xbVO6r8icvmrHNTC4EQEDTXJoHisyWk+LqsIuw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1735178985; c=relaxed/simple;
	bh=AbprpRWPMWZHDLO7RZKw4G76NIiS8OVR5URnK0/3PrA=;
	h=From:To:CC:Subject:Date:Message-ID:References:In-Reply-To:
	 Content-Type:MIME-Version;
 b=nTdRdWbsQjkzpDpNCBa2jlwqlIzkX5CTIeXx3D9qPPbxBhkdeO1tlwh5RAyQclojKuzdZ8Y1I7qxg+BrQGA+BK9boEAlul7Ts8Jrhw27tzn82Q8LIx5PrN5C4ZRvX8VjGbgXhQbkBdGY0m5DxifrnmGWcZQiRb6/o+Qol+jE/mI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=huawei.com;
 spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.188
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=huawei.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=huawei.com
Received: from mail.maildlp.com (unknown [172.19.88.194])
	by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4YJX8P4WNdzrS3l;
	Thu, 26 Dec 2024 10:07:49 +0800 (CST)
Received: from dggpemf100002.china.huawei.com (unknown [7.185.36.19])
	by mail.maildlp.com (Postfix) with ESMTPS id DF7F81402CB;
	Thu, 26 Dec 2024 10:09:33 +0800 (CST)
Received: from dggpemf100013.china.huawei.com (7.185.36.179) by
 dggpemf100002.china.huawei.com (7.185.36.19) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.2.1544.11; Thu, 26 Dec 2024 10:09:33 +0800
Received: from dggpemf100013.china.huawei.com ([7.185.36.179]) by
 dggpemf100013.china.huawei.com ([7.185.36.179]) with mapi id 15.02.1544.011;
 Thu, 26 Dec 2024 10:09:33 +0800
From: "liukai (Y)" <liukai284@huawei.com>
To: "mingo@redhat.com" <mingo@redhat.com>, "peterz@infradead.org"
	<peterz@infradead.org>, "juri.lelli@redhat.com" <juri.lelli@redhat.com>,
	"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>
CC: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "tanghui
 (C)" <tanghui20@huawei.com>, "Zhangqiao (2012 lab)" <zhangqiao22@huawei.com>,
	"Chenhui (Judy)" <judy.chenhui@huawei.com>, "'weiliang.qwl@antgroup.com'"
	<weiliang.qwl@antgroup.com>, "'henry.hj@antgroup.com'"
	<henry.hj@antgroup.com>, "'yanyan.yan@antgroup.com'"
	<yanyan.yan@antgroup.com>, "'libang.li@antgroup.com'"
	<libang.li@antgroup.com>, "liwei (JK)" <liwei728@huawei.com>
Subject: Trade-off between load_balance frequency and CPU utilization under
 high load
Thread-Topic: Trade-off between load_balance frequency and CPU utilization
 under high load
Thread-Index: AdtS1bduYNLkkWcuRV6j6HJAt9SDLgEYfquw
Date: Thu, 26 Dec 2024 02:09:33 +0000
Message-ID: <f10ea4ddee5d4ad588ca5aba3c5afac6@huawei.com>
References: <0aab22639ee0476a9a942bc4b06ebbce@huawei.com>
In-Reply-To: <0aab22639ee0476a9a942bc4b06ebbce@huawei.com>
Accept-Language: en-US
Content-Language: zh-CN
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"

In our performance experiments, by gradually increasing the CPU load, we
observed that under high load, the CPU utilization (node CPU) in kernel 6.6
is higher than in 4.19, reaching up to 4% higher. By capturing flame graph
data, we found that the total execution time of load_balance in kernel 6.6
is 18% longer than in 4.19.

Benchmark: specjbb // 6.6
index   target QPS   actual QPS           RT      pod CPU     node CPU
    1        60000        60004         0.73         5.06        14.97
    2       120000       120074         0.79        10.22        29.67
    3       180000       180866         0.87        16.00        45.91
    4       240000       240091         0.92        21.69        62.94

Benchmark: specjbb // 4.19
index   target QPS   actual QPS           RT      pod CPU     node CPU
    1        60000        60004         0.72         4.86        14.81
    2       120000       120074         0.79         9.69        29.52
    3       180000       180870         0.83        14.57        42.72
    4       240000       240074         0.90        19.55        58.59

we found that in kernel 6.6, the execution of load_balance is less costly.
Even under high load, the condition this_rq->avg_idle <
sd->max_newidle_lb_cost is still easily satisfied. As a result, compared to
kernel 4.19, load_balance is executed more frequently in 6.6, leading to
higher CPU utilization.

if (!READ_ONCE(this_rq->rd->overload) ||
    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

    if (sd)
        update_next_balance(sd, &next_balance);
    rcu_read_unlock();

    goto out;
}

We identified that the changes introduced by this patch
(https://lore.kernel.org/all/20211021095219.GG3891@suse.de/).

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10895,8 +10895,7 @@ static int newidle_balance(struct rq *this_rq, stru=
ct rq_flags *rf)
     rcu_read_lock();
     sd =3D rcu_dereference_check_sched_domain(this_rq->sd);

-     if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-         !READ_ONCE(this_rq->rd->overload) ||
+     if (!READ_ONCE(this_rq->rd->overload) ||
         (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

             if (sd)

this_rq->avg_idle < sysctl_sched_migration_cost can reduce the frequency of
load_balance under high load, is there any way to dynamically adjust the
execution of load_balance in high-load scenarios, in order to strike a
balance between maintaining good CPU utilization and avoiding unnecessary
load_balance executions?