From nobody Sun Feb  8 04:31:13 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CB43EEB64DD
	for <linux-kernel@archiver.kernel.org>; Thu, 27 Jul 2023 06:39:05 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232183AbjG0GjE (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 27 Jul 2023 02:39:04 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35850 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231743AbjG0GjC (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 27 Jul 2023 02:39:02 -0400
Received: from mgamail.intel.com (mga05.intel.com [192.55.52.43])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 16CE5211C
        for <linux-kernel@vger.kernel.org>;
 Wed, 26 Jul 2023 23:39:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1690439942; x=1721975942;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=l95sHoi/sZGqMzpe1qc3INiDxsmM2Z+gfXv8zvoNc08=;
  b=KsNuKrgWn4bNZz+B6W/RX9Gd8g4XsNqd7JYZsvSy/tbbnvn/b/xy5YJz
   hjvbV2/B2AKnGlRL+s3cEE8ACdwtCZm2nVdC1tuDZ4VgiU/7jWgnuh0oV
   7XaBdGAhkoeZYhtrRiSL9FUkNisQK8Yqgq/gDdsCzBcmcD/OOYvLizm+C
   Fg+nSj9Qk10zMAxSuAVXTBexFzsLP/MAjvJiVcYi2G5gXlfDGbxpLZRAx
   MSDyFQE92ex5hsKP087/LPTzm0utQc72Ih/gV71Z7aEml0V3HrXa8uSrA
   OAfhjkl9VhCmGgOTu39kFVIR6dEtEpvmXp+T9DqAeTpdosvq9JcYlhOjk
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="454589391"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="454589391"
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
  by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 26 Jul 2023 23:38:36 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="792191821"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="792191821"
Received: from chenyu-dev.sh.intel.com ([10.239.62.164])
  by fmsmga008.fm.intel.com with ESMTP; 26 Jul 2023 23:38:33 -0700
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>,
        Chen Yu <yu.chen.surf@gmail.com>,
        Aaron Lu <aaron.lu@intel.com>, linux-kernel@vger.kernel.org,
        Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH 1/7] sched/topology: Assign sd_share for all non NUMA
 sched domains
Date: Thu, 27 Jul 2023 22:34:22 +0800
Message-Id: 
 <169500eaa13198382765027eb047e6c7a0e5a13e.1690273854.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1690273854.git.yu.c.chen@intel.com>
References: <cover.1690273854.git.yu.c.chen@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

Currently, only the domain with SD_SHARE_PKG_RESOURCES flag
would share 1 sd_share for every CPU in this domain. Remove this
restriction and extend it for other sched domains under NUMA
domain.

This shared field will be used by a later patch which optimizes
newidle balancing.

Suggested-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/topology.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d3a3b2646ec4..64212f514765 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1641,10 +1641,10 @@ sd_init(struct sched_domain_topology_level *tl,
 	}
=20
 	/*
-	 * For all levels sharing cache; connect a sched_domain_shared
+	 * For all levels except for NUMA; connect a sched_domain_shared
 	 * instance.
 	 */
-	if (sd->flags & SD_SHARE_PKG_RESOURCES) {
+	if (!(sd->flags & SD_NUMA)) {
 		sd->shared =3D *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
--=20
2.25.1
From nobody Sun Feb  8 04:31:13 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 78936EB64DD
	for <linux-kernel@archiver.kernel.org>; Thu, 27 Jul 2023 06:38:59 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231997AbjG0Gi6 (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 27 Jul 2023 02:38:58 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35692 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231809AbjG0Giz (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 27 Jul 2023 02:38:55 -0400
Received: from mgamail.intel.com (mga12.intel.com [192.55.52.136])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E73B42126
        for <linux-kernel@vger.kernel.org>;
 Wed, 26 Jul 2023 23:38:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1690439931; x=1721975931;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=U+i30E359f/fjXNg0MSHQoaPJ3cOlGNvB3S19we3sqc=;
  b=k1qb8VbYFgI87j71HDWzhVM/DQVHefJ3Va6C4cyTT4w0Z3NF4ikJgOyc
   5TyFOYRL2hM4ifQOGiz5PlKR0Pfw8SHUttosxaJRVuwf0DQQfbC9ZEFbY
   PGbI1E9GpK9MBqUikK0EzQ0bc0W5vyF7sOk+FRVYMg/VICoTCE+S+9CVt
   a1oZzKlacmDt8FuGXPg0HCzPq0MZCw2B5fAMQkNd/ri9bTgxPp3mHyQW9
   7pKeF1qwTZoSrqIkmOykAAYJg5RbcEytyl5w4RXJKsvMVadLNeWhkCN0O
   xRZaEIsI3qX8i1Twskch5+bOzOxV6QB8yvxedEcpwNfSJWwrv1L95Z9jv
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="347829661"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="347829661"
Received: from fmsmga006.fm.intel.com ([10.253.24.20])
  by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 26 Jul 2023 23:38:51 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="973430014"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="973430014"
Received: from chenyu-dev.sh.intel.com ([10.239.62.164])
  by fmsmga006.fm.intel.com with ESMTP; 26 Jul 2023 23:38:48 -0700
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>,
        Chen Yu <yu.chen.surf@gmail.com>,
        Aaron Lu <aaron.lu@intel.com>, linux-kernel@vger.kernel.org,
        Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH 2/7] sched/topology: Introduce nr_groups in sched_domain
 to indicate the number of groups
Date: Thu, 27 Jul 2023 22:34:36 +0800
Message-Id: 
 <d4bb3b7f8911ef38e60138ec7b0a629062ae7c34.1690273854.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1690273854.git.yu.c.chen@intel.com>
References: <cover.1690273854.git.yu.c.chen@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

Record the number of sched groups within each sched domain. Prepare for
newidle_balance() scan depth calculation introduced by ILB_UTIL.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/topology.c        | 10 ++++++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 67b573d5bf28..c07f2f00317a 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -152,6 +152,7 @@ struct sched_domain {
 	struct sched_domain_shared *shared;
=20
 	unsigned int span_weight;
+	unsigned int nr_groups;
 	/*
 	 * Span of all CPUs in this domain.
 	 *
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 64212f514765..56dc564fc9a3 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1023,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, i=
nt cpu)
 	struct cpumask *covered =3D sched_domains_tmpmask;
 	struct sd_data *sdd =3D sd->private;
 	struct sched_domain *sibling;
-	int i;
+	int i, nr_groups =3D 0;
=20
 	cpumask_clear(covered);
=20
@@ -1087,6 +1087,8 @@ build_overlap_sched_groups(struct sched_domain *sd, i=
nt cpu)
 		if (!sg)
 			goto fail;
=20
+		nr_groups++;
+
 		sg_span =3D sched_group_span(sg);
 		cpumask_or(covered, covered, sg_span);
=20
@@ -1100,6 +1102,7 @@ build_overlap_sched_groups(struct sched_domain *sd, i=
nt cpu)
 		last->next =3D first;
 	}
 	sd->groups =3D first;
+	sd->nr_groups =3D nr_groups;
=20
 	return 0;
=20
@@ -1233,7 +1236,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 	struct sd_data *sdd =3D sd->private;
 	const struct cpumask *span =3D sched_domain_span(sd);
 	struct cpumask *covered;
-	int i;
+	int i, nr_groups =3D 0;
=20
 	lockdep_assert_held(&sched_domains_mutex);
 	covered =3D sched_domains_tmpmask;
@@ -1248,6 +1251,8 @@ build_sched_groups(struct sched_domain *sd, int cpu)
=20
 		sg =3D get_group(i, sdd);
=20
+		nr_groups++;
+
 		cpumask_or(covered, covered, sched_group_span(sg));
=20
 		if (!first)
@@ -1258,6 +1263,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
 	}
 	last->next =3D first;
 	sd->groups =3D first;
+	sd->nr_groups =3D nr_groups;
=20
 	return 0;
 }
--=20
2.25.1
From nobody Sun Feb  8 04:31:13 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2E337C001DC
	for <linux-kernel@archiver.kernel.org>; Thu, 27 Jul 2023 06:39:08 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232378AbjG0GjG (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 27 Jul 2023 02:39:06 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35904 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232193AbjG0GjF (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 27 Jul 2023 02:39:05 -0400
Received: from mgamail.intel.com (mga12.intel.com [192.55.52.136])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4007B211C
        for <linux-kernel@vger.kernel.org>;
 Wed, 26 Jul 2023 23:39:04 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1690439944; x=1721975944;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=gUXqBihJZoZ1Pmai0lJ204qnaiDMkb9JwZjj8InTaZc=;
  b=cm8VOyHLWxH3hb0BSCMV0gGOajMCl02SxzyxctQkizFHtR7MdAXflfwj
   c+S7owqC4aX8TFAZDwOWbfnL+LBHnLeVxiD7bKHoJdj23PR7rl8zblkgh
   krrTlCjUUuId+wW4OJGa+wjuNZcyiGebyvUVLv5KNBlFGEUPYmz9HDFN5
   WKNV4nS2Wjab1lXWVTOCzPa7Uw34A1RkzNssxOM38PkLd24940DGkDABC
   ODn65DxWNBA2H29q38/HArs5+ipfnbMfdyZU6QNEWqLl/OxDNGU7kGhw1
   9/fsWjbH/zQB7OzQ+7SZGPUpTqaooPymNyEC2LFP70sQdXM8FUuzZ6XFl
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="347829721"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="347829721"
Received: from fmsmga006.fm.intel.com ([10.253.24.20])
  by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 26 Jul 2023 23:39:03 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="973430138"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="973430138"
Received: from chenyu-dev.sh.intel.com ([10.239.62.164])
  by fmsmga006.fm.intel.com with ESMTP; 26 Jul 2023 23:39:00 -0700
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>,
        Chen Yu <yu.chen.surf@gmail.com>,
        Aaron Lu <aaron.lu@intel.com>, linux-kernel@vger.kernel.org,
        Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH 3/7] sched/fair: Save a snapshot of sched domain
 total_load and total_capacity
Date: Thu, 27 Jul 2023 22:34:50 +0800
Message-Id: 
 <0d71de8648889fe8b202be376e97d581ff3f12ed.1690273854.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1690273854.git.yu.c.chen@intel.com>
References: <cover.1690273854.git.yu.c.chen@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

Save the total_load, total_capacity of the current sched domain in each
periodic load balance. This statistic can be used later by CPU_NEWLY_IDLE
load balance if it quits the scan earlier. Introduce a sched feature
ILB_SNAPSHOT to control this. Code can check if sd_share->total_capacity
is non-zero to verify if the stat is valid.

In theory, if the system has reached a stable status, the total_capacity
and total_load should not change dramatically.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |  2 ++
 kernel/sched/fair.c            | 25 +++++++++++++++++++++++++
 kernel/sched/features.h        |  2 ++
 3 files changed, 29 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c07f2f00317a..d6a64a2c92aa 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -82,6 +82,8 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+	unsigned long	total_load;
+	unsigned long	total_capacity;
 };
=20
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3e25be58e2b..edcfee9965cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10132,6 +10132,27 @@ static void update_idle_cpu_scan(struct lb_env *en=
v,
 		WRITE_ONCE(sd_share->nr_idle_scan, (int)y);
 }
=20
+static void ilb_save_stats(struct lb_env *env,
+			   struct sched_domain_shared *sd_share,
+			   struct sd_lb_stats *sds)
+{
+	if (!sched_feat(ILB_SNAPSHOT))
+		return;
+
+	if (!sd_share)
+		return;
+
+	/* newidle balance is too frequent */
+	if (env->idle =3D=3D CPU_NEWLY_IDLE)
+		return;
+
+	if (sds->total_load !=3D sd_share->total_load)
+		WRITE_ONCE(sd_share->total_load, sds->total_load);
+
+	if (sds->total_capacity !=3D sd_share->total_capacity)
+		WRITE_ONCE(sd_share->total_capacity, sds->total_capacity);
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancin=
g.
  * @env: The load balancing environment.
@@ -10140,6 +10161,7 @@ static void update_idle_cpu_scan(struct lb_env *env,
=20
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_sta=
ts *sds)
 {
+	struct sched_domain_shared *sd_share =3D env->sd->shared;
 	struct sched_group *sg =3D env->sd->groups;
 	struct sg_lb_stats *local =3D &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
@@ -10209,6 +10231,9 @@ static inline void update_sd_lb_stats(struct lb_env=
 *env, struct sd_lb_stats *sd
 	}
=20
 	update_idle_cpu_scan(env, sum_util);
+
+	/* save a snapshot of stats during periodic load balance */
+	ilb_save_stats(env, sd_share, sds);
 }
=20
 /**
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..3cb71c8cddc0 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -101,3 +101,5 @@ SCHED_FEAT(LATENCY_WARN, false)
=20
 SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
+
+SCHED_FEAT(ILB_SNAPSHOT, true)
--=20
2.25.1
From nobody Sun Feb  8 04:31:13 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E40A6EB64DD
	for <linux-kernel@archiver.kernel.org>; Thu, 27 Jul 2023 06:39:55 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S230120AbjG0Gjy (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 27 Jul 2023 02:39:54 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36598 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232769AbjG0Gjk (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 27 Jul 2023 02:39:40 -0400
Received: from mgamail.intel.com (mga12.intel.com [192.55.52.136])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 590D82691
        for <linux-kernel@vger.kernel.org>;
 Wed, 26 Jul 2023 23:39:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1690439978; x=1721975978;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=QzTClbz4y9JgCaxQftHb02vD+LpQaxd1CzkR8WFoPIc=;
  b=Jj+p/CZwq4Xakgbg+72efVluWfNCyi2bbd8OtadglPtPtmaAmIo0wzOx
   4JBdBVAjEN+EbXASbc26zInFmS6mm7yXrTAtZEsc1cuRnGQ8IBSxpYQ+Y
   +1NDyhGsr6NjjiarQXnF1urZ1gl96TvfNVfpS/mhU4h/i2kyWTexpISuj
   u7lyxp2tOPPzh4dKcKzvDLy6ABIwHt/Im+/oTGaY/Eyv/z5jZQInASJt2
   pb3cZdxVKhi0mLW8gTg/eHTJb+D5tWEKmMzat5nqhrafRk4JCZ+midZSO
   gp4yp7W1vukSNcvkguctASKt4uJe201yhWc6YT6SZRJw4V8pC6fPwZOKg
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="347829789"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="347829789"
Received: from fmsmga006.fm.intel.com ([10.253.24.20])
  by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 26 Jul 2023 23:39:14 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="973430346"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="973430346"
Received: from chenyu-dev.sh.intel.com ([10.239.62.164])
  by fmsmga006.fm.intel.com with ESMTP; 26 Jul 2023 23:39:11 -0700
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>,
        Chen Yu <yu.chen.surf@gmail.com>,
        Aaron Lu <aaron.lu@intel.com>, linux-kernel@vger.kernel.org,
        Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH 4/7] sched/fair: Calculate the scan depth for idle balance
 based on system utilization
Date: Thu, 27 Jul 2023 22:35:02 +0800
Message-Id: 
 <61e6fce60ca738215b6e5ad9033fb692c3a8fbb1.1690273854.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1690273854.git.yu.c.chen@intel.com>
References: <cover.1690273854.git.yu.c.chen@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

When the CPU is about to enter idle, it invokes newidle_balance()
to pull some tasks from other runqueues. Although there is per
domain max_newidle_lb_cost to throttle the newidle_balance(), it
would be good to further limit the scan based on overall system
utilization. The reason is that there is no limitation for
newidle_balance() to launch this balance simultaneously on
multiple CPUs. Since each newidle_balance() has to traverse all
the groups to calculate the statistics one by one, this total
time cost on newidle_balance() could be O(n^2). n is the number
of groups. This issue is more severe if there are many groups
within 1 domain, for example, a system with a large number of
Cores in a LLC domain. This is not good for performance or
power saving.

sqlite has spent quite some time on newidle balance() on Intel
Sapphire Rapids, which has 2 x 56C/112T =3D 224 CPUs:
6.69%    0.09%  sqlite3     [kernel.kallsyms]   [k] newidle_balance
5.39%    4.71%  sqlite3     [kernel.kallsyms]   [k] update_sd_lb_stats

Based on this observation, limit the scan depth of newidle_balance()
by considering the utilization of the sched domain. Let the number of
scanned groups be a linear function of the utilization ratio:

nr_groups_to_scan =3D nr_groups * (1 - util_ratio)

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 30 ++++++++++++++++++++++++++++++
 kernel/sched/features.h        |  1 +
 3 files changed, 32 insertions(+)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index d6a64a2c92aa..af2261308529 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -84,6 +84,7 @@ struct sched_domain_shared {
 	int		nr_idle_scan;
 	unsigned long	total_load;
 	unsigned long	total_capacity;
+	int		nr_sg_scan;
 };
=20
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index edcfee9965cd..6925813db59b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10153,6 +10153,35 @@ static void ilb_save_stats(struct lb_env *env,
 		WRITE_ONCE(sd_share->total_capacity, sds->total_capacity);
 }
=20
+static void update_ilb_group_scan(struct lb_env *env,
+				  unsigned long sum_util,
+				  struct sched_domain_shared *sd_share)
+{
+	u64 tmp, nr_scan;
+
+	if (!sched_feat(ILB_UTIL))
+		return;
+
+	if (!sd_share)
+		return;
+
+	if (env->idle =3D=3D CPU_NEWLY_IDLE)
+		return;
+
+	/*
+	 * Limit the newidle balance scan depth based on overall system
+	 * utilization:
+	 * nr_groups_scan =3D nr_groups * (1 - util_ratio)
+	 * and util_ratio =3D sum_util / (sd_weight * SCHED_CAPACITY_SCALE)
+	 */
+	nr_scan =3D env->sd->nr_groups * sum_util;
+	tmp =3D env->sd->span_weight * SCHED_CAPACITY_SCALE;
+	do_div(nr_scan, tmp);
+	nr_scan =3D env->sd->nr_groups - nr_scan;
+	if ((int)nr_scan !=3D sd_share->nr_sg_scan)
+		WRITE_ONCE(sd_share->nr_sg_scan, (int)nr_scan);
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancin=
g.
  * @env: The load balancing environment.
@@ -10231,6 +10260,7 @@ static inline void update_sd_lb_stats(struct lb_env=
 *env, struct sd_lb_stats *sd
 	}
=20
 	update_idle_cpu_scan(env, sum_util);
+	update_ilb_group_scan(env, sum_util, sd_share);
=20
 	/* save a snapshot of stats during periodic load balance */
 	ilb_save_stats(env, sd_share, sds);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3cb71c8cddc0..30f6d1a2f235 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -103,3 +103,4 @@ SCHED_FEAT(ALT_PERIOD, true)
 SCHED_FEAT(BASE_SLICE, true)
=20
 SCHED_FEAT(ILB_SNAPSHOT, true)
+SCHED_FEAT(ILB_UTIL, true)
--=20
2.25.1
From nobody Sun Feb  8 04:31:13 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B60A7C001DC
	for <linux-kernel@archiver.kernel.org>; Thu, 27 Jul 2023 06:39:40 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232753AbjG0Gjj (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 27 Jul 2023 02:39:39 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35988 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232203AbjG0Gjf (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 27 Jul 2023 02:39:35 -0400
Received: from mgamail.intel.com (mga11.intel.com [192.55.52.93])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0B4D01FFC
        for <linux-kernel@vger.kernel.org>;
 Wed, 26 Jul 2023 23:39:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1690439967; x=1721975967;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=hP7V3JXYo3iPIjXM13Kdhb3K2U4aYRXUttqz2K7TE7g=;
  b=Y1Va/xU0wMdKNv/XcKgr94xeKXMR6Hk6HwHXcceFLp5y4yMHXF4h5/Q6
   2uJa0uroGVzcg3S7143J9x9t9CM40pj8AhR5S35ZjDYZcPUXdwE5J45Bf
   9lfOvoUVvdI96eZHD8EXqBru6zN0P66Hs94dUiSZ3H8KwAF+2o8aZyIMD
   gjFoUiQEf7fpuD1XtUz8zMfIi6TJN2hTbBshOv7rb6bm4YuZyEjTBbHlz
   q8Q6km2De72srt6Pt34CjizIpMZ0uC0srBcaeBlKWkIPx2YgBWF2omFJm
   dt8CdWwoZTXLAIuePBaEp/bX+zryHQN/z31rRsZxzrv1twXEEV4NmuiYx
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="365681560"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="365681560"
Received: from fmsmga003.fm.intel.com ([10.253.24.29])
  by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 26 Jul 2023 23:39:25 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="816993073"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="816993073"
Received: from chenyu-dev.sh.intel.com ([10.239.62.164])
  by FMSMGA003.fm.intel.com with ESMTP; 26 Jul 2023 23:39:22 -0700
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>,
        Chen Yu <yu.chen.surf@gmail.com>,
        Aaron Lu <aaron.lu@intel.com>, linux-kernel@vger.kernel.org,
        Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth
 in idle load balance
Date: Thu, 27 Jul 2023 22:35:13 +0800
Message-Id: 
 <98e26a26832669b4293a50a701f9b3b8d44e4863.1690273854.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1690273854.git.yu.c.chen@intel.com>
References: <cover.1690273854.git.yu.c.chen@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

Scanning the whole sched domain to find the busiest group is time costly
during newidle_balance(). And if a CPU becomes idle, it would be good
if this idle CPU pulls some tasks from other CPUs as quickly as possible.

Limit the scan depth of newidle_balance() to only scan for a limited number
of sched groups to find a relatively busy group, and pull from it.
In summary, the more spare time there is in the domain, the more time
each newidle balance can spend on scanning for a busy group. Although
the newidle balance has per domain max_newidle_lb_cost to decide
whether to launch the balance or not, the ILB_UTIL provides a smaller
granularity to decide how many groups each newidle balance can scan.

The scanning depth is calculated by the previous periodic load balance
based on its overall utilization.

Tested on top of v6.5-rc2, Sapphire Rapids with 2 x 56C/112T =3D 224 CPUs.
With cpufreq governor set to performance, and C6 disabled.

Firstly, tested on a extreme synthetic test[1], which launches 224
process. Each process is a loop of nanosleep(1 us), which is supposed
to trigger newidle balance as much as possible:

i=3D1;while [ $i -le "224" ]; do ./nano_sleep 1000 & i=3D$(($i+1)); done;

NO_ILB_UTIL + ILB_SNAPSHOT:
9.38%     0.45%  [kernel.kallsyms]   [k] newidle_balance
6.84%     5.32%  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0

ILB_UTIL + ILB_SNAPSHOT:
3.35%     0.38%  [kernel.kallsyms]   [k] newidle_balance
2.30%     1.81%  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0

With ILB_UTIL enabled, the total number of newidle_balance() and
update_sd_lb() drops. But the reason for why there are less newidle
balance has not been investigated. According to the low util_avg value
in /sys/kernel/debug/sched/debug, there should be no much impact
on the nanosleep stress test.

Test in a wider range:

[netperf]
Launches nr instances of:
netperf -4 -H 127.0.0.1 -t $work_mode -c -C -l 100 &

nr: 56, 112, 168, 224, 280, 336, 392, 448
work_mode: TCP_RR UDP_RR

throughput
=3D=3D=3D=3D=3D=3D=3D
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	56-threads	 1.00 (  5.15)	 -3.96 (  2.17)
TCP_RR          	112-threads	 1.00 (  2.84)	 -0.82 (  2.24)
TCP_RR          	168-threads	 1.00 (  2.11)	 -0.03 (  2.31)
TCP_RR          	224-threads	 1.00 (  1.76)	 +0.01 (  2.12)
TCP_RR          	280-threads	 1.00 ( 62.46)	+56.56 ( 56.91)
TCP_RR          	336-threads	 1.00 ( 19.81)	 +0.27 ( 17.90)
TCP_RR          	392-threads	 1.00 ( 30.85)	 +0.13 ( 29.09)
TCP_RR          	448-threads	 1.00 ( 39.71)	-18.82 ( 45.93)
UDP_RR          	56-threads	 1.00 (  2.08)	 -0.31 (  7.89)
UDP_RR          	112-threads	 1.00 (  3.22)	 -0.50 ( 15.19)
UDP_RR          	168-threads	 1.00 ( 11.77)	 +0.37 ( 10.30)
UDP_RR          	224-threads	 1.00 ( 14.03)	 +0.25 ( 12.88)
UDP_RR          	280-threads	 1.00 ( 16.83)	 -0.57 ( 15.34)
UDP_RR          	336-threads	 1.00 ( 22.57)	 +0.01 ( 24.68)
UDP_RR          	392-threads	 1.00 ( 33.89)	 +2.65 ( 33.89)
UDP_RR          	448-threads	 1.00 ( 44.18)	 +0.81 ( 41.28)

Considering the std%, there is no much difference to netperf.

[tbench]
tbench -t 100 $job 127.0.0.1
job: 56, 112, 168, 224, 280, 336, 392, 448

throughput
=3D=3D=3D=3D=3D=3D
case            	load    	baseline(std%)	compare%( std%)
loopback        	56-threads	 1.00 (  2.20)	 -0.09 (  2.05)
loopback        	112-threads	 1.00 (  0.29)	 -0.88 (  0.10)
loopback        	168-threads	 1.00 (  0.02)	+62.92 ( 54.57)
loopback        	224-threads	 1.00 (  0.05)	+234.30 (  1.81)
loopback        	280-threads	 1.00 (  0.08)	 -0.11 (  0.21)
loopback        	336-threads	 1.00 (  0.17)	 -0.17 (  0.08)
loopback        	392-threads	 1.00 (  0.14)	 -0.09 (  0.18)
loopback        	448-threads	 1.00 (  0.24)	 -0.53 (  0.55)

There are improvement of tbench in 224 threads case.

[hackbench]

hackbench -g $job --$work_type --pipe -l 200000 -s 100 -f 28
and
hackbench -g $job --$work_type -l 200000 -s 100 -f 28

job: 1, 2, 4, 8
work_type: process threads

throughput
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
case                    load            baseline(std%)  compare%( std%)
process-pipe            1-groups         1.00 (  0.20)   +1.57 (  0.58)
process-pipe            2-groups         1.00 (  3.53)   +2.99 (  2.03)
process-pipe            4-groups         1.00 (  1.07)   +0.17 (  1.64)
process-sockets         1-groups         1.00 (  0.36)   -0.04 (  1.44)
process-sockets         2-groups         1.00 (  0.84)   +0.65 (  1.65)
process-sockets         4-groups         1.00 (  0.04)   +0.89 (  0.08)
threads-pipe            1-groups         1.00 (  3.62)   -0.53 (  1.67)
threads-pipe            2-groups         1.00 (  4.17)   -4.79 (  0.53)
threads-pipe            4-groups         1.00 (  5.30)   +5.06 (  1.95)
threads-sockets         1-groups         1.00 (  0.40)   +1.44 (  0.53)
threads-sockets         2-groups         1.00 (  2.54)   +2.21 (  2.51)
threads-sockets         4-groups         1.00 (  0.05)   +1.29 (  0.05)

No much difference of hackbench.

[schbench(old)]
schbench -m $job -t 56 -r 30
job: 1, 2, 4, 8
3 iterations

99.0th latency
=3D=3D=3D=3D=3D=3D=3D=3D
case                    load            baseline(std%)  compare%( std%)
normal                  1-mthreads       1.00 (  0.56)   -0.91 (  0.32)
normal                  2-mthreads       1.00 (  0.95)   -4.05 (  3.63)
normal                  4-mthreads       1.00 (  4.04)   -0.30 (  2.35)

No much difference of schbench.

[Limitation]
In the previous version, Prateek reported a regression. That could be
due to the concurrent access across the Numa node, or ILB_UTIL did not
scan hard enough to pull from the busiest group. The former issue is
fixed by not enabling ILB_UTIL for Numa domain. If there is still
regression in this version, we can leverage the result of SIS_UTIL,
to provide a quadratic function rather than the linear function, to
scan harder when the system is idle.

Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/stress_na=
nosleep.c #1
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6925813db59b..4e360ed16e14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10195,7 +10195,13 @@ static inline void update_sd_lb_stats(struct lb_en=
v *env, struct sd_lb_stats *sd
 	struct sg_lb_stats *local =3D &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
 	unsigned long sum_util =3D 0;
-	int sg_status =3D 0;
+	int sg_status =3D 0, nr_sg_scan;
+	/* only newidle CPU can load the snapshot */
+	bool ilb_can_load =3D env->idle =3D=3D CPU_NEWLY_IDLE &&
+			    sd_share && READ_ONCE(sd_share->total_capacity);
+
+	if (sched_feat(ILB_UTIL) && ilb_can_load)
+		nr_sg_scan =3D sd_share->nr_sg_scan;
=20
 	do {
 		struct sg_lb_stats *sgs =3D &tmp_sgs;
@@ -10222,6 +10228,9 @@ static inline void update_sd_lb_stats(struct lb_env=
 *env, struct sd_lb_stats *sd
 			sds->busiest_stat =3D *sgs;
 		}
=20
+		if (sched_feat(ILB_UTIL) && ilb_can_load && --nr_sg_scan <=3D 0)
+			goto load_snapshot;
+
 next_group:
 		/* Now, start updating sd_lb_stats */
 		sds->total_load +=3D sgs->group_load;
@@ -10231,6 +10240,15 @@ static inline void update_sd_lb_stats(struct lb_en=
v *env, struct sd_lb_stats *sd
 		sg =3D sg->next;
 	} while (sg !=3D env->sd->groups);
=20
+	ilb_can_load =3D false;
+
+load_snapshot:
+	if (ilb_can_load) {
+		/* borrow the statistic of previous periodic load balance */
+		sds->total_load =3D READ_ONCE(sd_share->total_load);
+		sds->total_capacity =3D READ_ONCE(sd_share->total_capacity);
+	}
+
 	/*
 	 * Indicate that the child domain of the busiest group prefers tasks
 	 * go to a child's sibling domains first. NB the flags of a sched group
--=20
2.25.1
From nobody Sun Feb  8 04:31:13 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 27F9BEB64DD
	for <linux-kernel@archiver.kernel.org>; Thu, 27 Jul 2023 06:39:59 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232849AbjG0Gj6 (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 27 Jul 2023 02:39:58 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36652 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232787AbjG0Gjl (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 27 Jul 2023 02:39:41 -0400
Received: from mgamail.intel.com (mga05.intel.com [192.55.52.43])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6CF8026A4
        for <linux-kernel@vger.kernel.org>;
 Wed, 26 Jul 2023 23:39:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1690439979; x=1721975979;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=qlxopXTa2XESRuc0f7vj5sTVsk6dibKkwBX9ydkssyI=;
  b=X1p47Bdw0Xel/zgU8EPZeb8N74Ppt9zvbyQEk/HzW/oUIXAN93SGzla2
   dVgQ6Ua/8MAiB8Fa0vejwTf7lx5CkO0TtHKu+hX7CQ+6N2jxXrGwfbPMo
   b2SNO6blzLX0/LWGFIEFnefPhPJ2DNy6nUX+xvjU43TqPktu0xPGntpfi
   Emqboa1iZnkU/pbiz8E3mXWv7vHFSMfJcHefNvu5VoB+nE7XV6/DwgcRY
   9sV6aoWFdFjr9fhWtJvS+mT0GS1m2+TlziRRMhwyqCaiA8Mwat9IIvoTZ
   TRBpkolB5wRURxaMoZVwta5rJTgqiJRSuNBqEFTPwYaCFn7+QNNUHJmL3
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="454589597"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="454589597"
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
  by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 26 Jul 2023 23:39:38 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="792191903"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="792191903"
Received: from chenyu-dev.sh.intel.com ([10.239.62.164])
  by fmsmga008.fm.intel.com with ESMTP; 26 Jul 2023 23:39:33 -0700
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>,
        Chen Yu <yu.chen.surf@gmail.com>,
        Aaron Lu <aaron.lu@intel.com>, linux-kernel@vger.kernel.org,
        Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH 6/7] sched/fair: Pull from a relatively busy group during
 newidle balance
Date: Thu, 27 Jul 2023 22:35:24 +0800
Message-Id: 
 <c2d22f41258c003f2d011867faae140f5d2f952f.1690273854.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1690273854.git.yu.c.chen@intel.com>
References: <cover.1690273854.git.yu.c.chen@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

Scanning the whole sched domain to find the busiest group is time
costly during newidle_balance() on a high core count system.

Introduce ILB_FAST to lower the bar during the busiest group
scanning. If the target sched group is relatively busier than the
local group, terminate the scan and try to pull from that group
directly.

Compared between ILB_UTIL and ILB_FAST, the former inhibits the
sched group scan when the system is busy. While the latter
choose a compromised busy group when the system is not busy.
So they are complementary to each other and work independently.

Tested on top of v6.5-rc2,
Sapphire Rapids with 2 x 56C/112T =3D 224 CPUs.
With cpufreq governor set to performance, and C6 disabled.

Firstly, tested on an extreme synthetic test[1] borrowed from
Tianyou. It launches 224 process. Each process is a loop of
nanosleep(1 us), which is supposed to trigger newidle balance
frequently:

i=3D1;while [ $i -le "224" ]; do ./nano_sleep 1000 & i=3D$(($i+1)); done;

[ILB_SNAPSHOT + NO_ILB_UTIL + NO_ILB_FAST]
Check the /proc/schedstat delta on CPU8 within 5 seconds using
the following script[2] by running: schedstat.py -i 5 -c 8

Mon Jul 24 23:43:43 2023              cpu8
.domain0.CPU_IDLE.lb_balanced          843
.domain0.CPU_IDLE.lb_count             843
.domain0.CPU_IDLE.lb_nobusyg           843
.domain0.CPU_IDLE.lb_sg_scan           843
.domain0.CPU_NEWLY_IDLE.lb_balanced    836
.domain0.CPU_NEWLY_IDLE.lb_count       837
.domain0.CPU_NEWLY_IDLE.lb_gained       1
.domain0.CPU_NEWLY_IDLE.lb_imbalance    1
.domain0.CPU_NEWLY_IDLE.lb_nobusyg     836
.domain0.CPU_NEWLY_IDLE.lb_sg_scan     837
.domain1.CPU_IDLE.lb_balanced          41
.domain1.CPU_IDLE.lb_count             41
.domain1.CPU_IDLE.lb_nobusyg           39
.domain1.CPU_IDLE.lb_sg_scan          2145
.domain1.CPU_NEWLY_IDLE.lb_balanced    732     <-----
.domain1.CPU_NEWLY_IDLE.lb_count       822     <-----
.domain1.CPU_NEWLY_IDLE.lb_failed      90
.domain1.CPU_NEWLY_IDLE.lb_imbalance   90
.domain1.CPU_NEWLY_IDLE.lb_nobusyg     497
.domain1.CPU_NEWLY_IDLE.lb_nobusyq     235
.domain1.CPU_NEWLY_IDLE.lb_sg_scan    45210    <-----
.domain1.ttwu_wake_remote              626
.domain2.CPU_IDLE.lb_balanced          15
.domain2.CPU_IDLE.lb_count             15
.domain2.CPU_NEWLY_IDLE.lb_balanced    635
.domain2.CPU_NEWLY_IDLE.lb_count       655
.domain2.CPU_NEWLY_IDLE.lb_failed      20
.domain2.CPU_NEWLY_IDLE.lb_imbalance   40
.domain2.CPU_NEWLY_IDLE.lb_nobusyg     633
.domain2.CPU_NEWLY_IDLE.lb_nobusyq      2
.domain2.CPU_NEWLY_IDLE.lb_sg_scan     655
.stats.rq_cpu_time                   227910772
.stats.rq_sched_info.pcount           89393
.stats.rq_sched_info.run_delay       2145671
.stats.sched_count                   178783
.stats.sched_goidle                   89390
.stats.ttwu_count                     89392
.stats.ttwu_local                     88766

For domain1, there are 822 newidle balance attempt, and
the total number of groups scanned is 45210, thus each
balance would scan for 55 groups. During this 822 balance,
732 becomes(or are already) balanced, so the effect balance
success ratio is (822 - 732) / 822 =3D 10.94%

The perf:
9.38%     0.45%  [kernel.kallsyms]   [k] newidle_balance
6.84%     5.32%  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0

[ILB_SNAPSHOT + NO_ILB_UTIL + ILB_FAST]
Mon Jul 24 23:43:50 2023              cpu8
.domain0.CPU_IDLE.lb_balanced          918
.domain0.CPU_IDLE.lb_count             918
.domain0.CPU_IDLE.lb_nobusyg           918
.domain0.CPU_IDLE.lb_sg_scan           918
.domain0.CPU_NEWLY_IDLE.lb_balanced   1536
.domain0.CPU_NEWLY_IDLE.lb_count      1545
.domain0.CPU_NEWLY_IDLE.lb_failed       1
.domain0.CPU_NEWLY_IDLE.lb_gained       8
.domain0.CPU_NEWLY_IDLE.lb_imbalance    9
.domain0.CPU_NEWLY_IDLE.lb_nobusyg    1536
.domain0.CPU_NEWLY_IDLE.lb_sg_scan    1545
.domain1.CPU_IDLE.lb_balanced          45
.domain1.CPU_IDLE.lb_count             45
.domain1.CPU_IDLE.lb_nobusyg           43
.domain1.CPU_IDLE.lb_sg_scan          2365
.domain1.CPU_NEWLY_IDLE.lb_balanced   1196     <------
.domain1.CPU_NEWLY_IDLE.lb_count      1496     <------
.domain1.CPU_NEWLY_IDLE.lb_failed      296
.domain1.CPU_NEWLY_IDLE.lb_gained       4
.domain1.CPU_NEWLY_IDLE.lb_imbalance   301
.domain1.CPU_NEWLY_IDLE.lb_nobusyg    1182
.domain1.CPU_NEWLY_IDLE.lb_nobusyq     14
.domain1.CPU_NEWLY_IDLE.lb_sg_scan    30127    <------
.domain1.ttwu_wake_remote             2688
.domain2.CPU_IDLE.lb_balanced          13
.domain2.CPU_IDLE.lb_count             13
.domain2.CPU_NEWLY_IDLE.lb_balanced    898
.domain2.CPU_NEWLY_IDLE.lb_count       904
.domain2.CPU_NEWLY_IDLE.lb_failed       6
.domain2.CPU_NEWLY_IDLE.lb_imbalance   11
.domain2.CPU_NEWLY_IDLE.lb_nobusyg     896
.domain2.CPU_NEWLY_IDLE.lb_nobusyq      2
.domain2.CPU_NEWLY_IDLE.lb_sg_scan     904
.stats.rq_cpu_time                   239830575
.stats.rq_sched_info.pcount           90879
.stats.rq_sched_info.run_delay       2436461
.stats.sched_count                   181732
.stats.sched_goidle                   90853
.stats.ttwu_count                     90880
.stats.ttwu_local                     88192

With ILB_FAST enabled, the CPU_NEWLY_IDLE in domain1 on CPU8
is 1496, and the total number of groups scanned is 30127. For
each load balance, it will scan for 20 groups, which is only
half of the 56 groups in a domain. During this 1496 balance,
1196 are balanced, so the effect balance success ratio
is (1496 - 1196) / 1496 =3D 20.95%, which is higher than 10.94%
when ILB_FAST is disabled.

perf profile:

2.95%     0.38%  [kernel.kallsyms]   [k] newidle_balance
2.00%     1.51%  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0

With ILB_FAST enabled, the total update_sd_lb_stats() has dropped a lot.

More benchmark results are shown below.
Baseline is ILB_SNAPSHOT + NO_ILB_UTIL, to compare with
ILB_SNAPSHOT + NO_ILB_UTIL + ILB_FAST

[netperf]
Launches nr instances of:
netperf -4 -H 127.0.0.1 -t $work_mode -c -C -l 100 &

nr: 56, 112, 168, 224, 280, 336, 392, 448
work_mode: TCP_RR UDP_RR

throughput
=3D=3D=3D=3D=3D=3D=3D
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	56-threads	 1.00 (  1.83)	 +4.25 (  5.15)
TCP_RR          	112-threads	 1.00 (  2.19)	 +0.96 (  2.84)
TCP_RR          	168-threads	 1.00 (  1.92)	 -0.04 (  2.11)
TCP_RR          	224-threads	 1.00 (  1.98)	 -0.03 (  1.76)
TCP_RR          	280-threads	 1.00 ( 63.11)	 -7.59 ( 62.46)
TCP_RR          	336-threads	 1.00 ( 18.44)	 -0.45 ( 19.81)
TCP_RR          	392-threads	 1.00 ( 26.49)	 -0.09 ( 30.85)
TCP_RR          	448-threads	 1.00 ( 40.47)	 -0.28 ( 39.71)
UDP_RR          	56-threads	 1.00 (  1.83)	 -0.31 (  2.08)
UDP_RR          	112-threads	 1.00 ( 13.77)	 +3.58 (  3.22)
UDP_RR          	168-threads	 1.00 ( 10.97)	 -0.08 ( 11.77)
UDP_RR          	224-threads	 1.00 ( 12.83)	 -0.04 ( 14.03)
UDP_RR          	280-threads	 1.00 ( 13.89)	 +0.35 ( 16.83)
UDP_RR          	336-threads	 1.00 ( 24.91)	 +1.38 ( 22.57)
UDP_RR          	392-threads	 1.00 ( 34.86)	 -0.91 ( 33.89)
UDP_RR          	448-threads	 1.00 ( 40.63)	 +0.70 ( 44.18)

[tbench]
tbench -t 100 $job 127.0.0.1
job: 56, 112, 168, 224, 280, 336, 392, 448

throughput
=3D=3D=3D=3D=3D=3D
case                    load            baseline(std%)  compare%( std%)
loopback                56-threads       1.00 (  0.89)   +1.51 (  2.20)
loopback                112-threads      1.00 (  0.03)   +1.15 (  0.29)
loopback                168-threads      1.00 ( 53.55)  -37.92 (  0.02)
loopback                224-threads      1.00 ( 61.24)  -43.18 (  0.01)
loopback                280-threads      1.00 (  0.04)   +0.33 (  0.08)
loopback                336-threads      1.00 (  0.35)   +0.40 (  0.17)
loopback                392-threads      1.00 (  0.61)   +0.49 (  0.14)
loopback                448-threads      1.00 (  0.08)   +0.01 (  0.24)

[schbench]
schbench -m $job -t 56 -r 30
job: 1, 2, 4, 8
3 iterations

99.0th latency
=3D=3D=3D=3D=3D=3D=3D=3D
case                    load            baseline(std%)  compare%( std%)
normal                  1-mthreads       1.00 (  0.56)   -0.45 (  0.32)
normal                  2-mthreads       1.00 (  0.95)   +1.01 (  3.45)
normal                  4-mthreads       1.00 (  4.04)   -0.60 (  1.26)

[hackbench]
hackbench -g $job --$work_type --pipe -l 200000 -s 100 -f 28
and
hackbench -g $job --$work_type -l 200000 -s 100 -f 28

job: 1, 2, 4, 8
work_type: process threads

throughput
=3D=3D=3D=3D=3D=3D=3D=3D=3D
case                    load            baseline(std%)  compare%( std%)
process-pipe            1-groups         1.00 (  0.20)   +2.30 (  0.26)
process-pipe            2-groups         1.00 (  3.53)   +6.14 (  2.45)
process-pipe            4-groups         1.00 (  1.07)   -4.58 (  2.58)
process-sockets         1-groups         1.00 (  0.36)   +0.75 (  1.22)
process-sockets         2-groups         1.00 (  0.84)   +1.26 (  1.11)
process-sockets         4-groups         1.00 (  0.04)   +0.97 (  0.11)
threads-pipe            1-groups         1.00 (  3.62)   +3.22 (  2.64)
threads-pipe            2-groups         1.00 (  4.17)   +5.85 (  7.53)
threads-pipe            4-groups         1.00 (  5.30)   -4.14 (  5.39)
threads-sockets         1-groups         1.00 (  0.40)   +3.50 (  3.13)
threads-sockets         2-groups         1.00 (  2.54)   +1.79 (  0.80)
threads-sockets         4-groups         1.00 (  0.05)   +1.33 (  0.03)

Considering the std%, there is no much score difference noticed.
It probably indicates that ILB_FAST has reduced the cost of newidle
balance without hurting the performance.

Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/stress_na=
nosleep.c #1
Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/schedstat=
.py #2
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c     | 37 +++++++++++++++++++++++++++++++++++++
 kernel/sched/features.h |  1 +
 2 files changed, 38 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e360ed16e14..9af57b5a24dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10182,6 +10182,36 @@ static void update_ilb_group_scan(struct lb_env *e=
nv,
 		WRITE_ONCE(sd_share->nr_sg_scan, (int)nr_scan);
 }
=20
+static bool can_pull_busiest(struct sg_lb_stats *local,
+			     struct sg_lb_stats *busiest)
+{
+	/*
+	 * Check if the local group can pull from the 'busiest'
+	 * group directly. When reaching here, update_sd_pick_busiest()
+	 * has already filtered a candidate.
+	 * The scan in newidle load balance on high core count system
+	 * is costly, thus provide this shortcut to find a relative busy
+	 * group rather than the busiest one.
+	 *
+	 * Only enable this shortcut when the local group is quite
+	 * idle. This is because the total cost of newidle_balance()
+	 * becomes severe when multiple CPUs fall into idle and launch
+	 * newidle_balance() concurrently. And that usually indicates
+	 * a group_has_spare status.
+	 */
+	if (local->group_type !=3D group_has_spare)
+		return false;
+
+	if (busiest->idle_cpus > local->idle_cpus)
+		return false;
+
+	if (busiest->idle_cpus =3D=3D local->idle_cpus &&
+	    busiest->sum_nr_running <=3D local->sum_nr_running)
+		return false;
+
+	return true;
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancin=
g.
  * @env: The load balancing environment.
@@ -10226,6 +10256,13 @@ static inline void update_sd_lb_stats(struct lb_en=
v *env, struct sd_lb_stats *sd
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest =3D sg;
 			sds->busiest_stat =3D *sgs;
+			/*
+			 * Check if this busiest group can be pulled by the
+			 * local group directly.
+			 */
+			if (sched_feat(ILB_FAST) && ilb_can_load &&
+			    can_pull_busiest(local, sgs))
+				goto load_snapshot;
 		}
=20
 		if (sched_feat(ILB_UTIL) && ilb_can_load && --nr_sg_scan <=3D 0)
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 30f6d1a2f235..4d67e0abb78c 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -104,3 +104,4 @@ SCHED_FEAT(BASE_SLICE, true)
=20
 SCHED_FEAT(ILB_SNAPSHOT, true)
 SCHED_FEAT(ILB_UTIL, true)
+SCHED_FEAT(ILB_FAST, true)
--=20
2.25.1
From nobody Sun Feb  8 04:31:13 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DA3DAC001DC
	for <linux-kernel@archiver.kernel.org>; Thu, 27 Jul 2023 06:40:17 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231614AbjG0GkP (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 27 Jul 2023 02:40:15 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36810 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232805AbjG0Gj7 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 27 Jul 2023 02:39:59 -0400
Received: from mgamail.intel.com (mga05.intel.com [192.55.52.43])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 81C8B2129
        for <linux-kernel@vger.kernel.org>;
 Wed, 26 Jul 2023 23:39:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1690439991; x=1721975991;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=2tUSZMdSmW/MZmHWfmtvUAfg03unaamp6z41Ty0I/Fk=;
  b=VE5KPygDoUrmAUBxeFBAT5ujtTGSW5Z/OEjLOMSsJUbawqLLaQlI0GLq
   S7eoIH9sesQ46aAcR0F8qVfS33QLnFXj7paRz6qQ/w71cAkQyD6SBD5+F
   vrawCUyWVkMfZl34kZODn9PBw91WfkbqP1ZcrgUV5bfs7nVAndwe75V8g
   7kVa1FodNeO3o7GRO0FonPSdrJBAaiIzYFP63uG5jBRQRf8WblUSWcF4+
   sCLOUod7+M9yN0QfS+9S3TIgShKjHFcfK4LqEAMyjEAJKw4Ce17grrmRn
   IYyqUEU1sBkRZlLjwRd/DyR6jww8IcBCIaBLXgCndLm0aaB5n0uFOoYCH
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="454589658"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="454589658"
Received: from fmsmga008.fm.intel.com ([10.253.24.58])
  by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 26 Jul 2023 23:39:50 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="792191955"
X-IronPort-AV: E=Sophos;i="6.01,234,1684825200";
   d="scan'208";a="792191955"
Received: from chenyu-dev.sh.intel.com ([10.239.62.164])
  by fmsmga008.fm.intel.com with ESMTP; 26 Jul 2023 23:39:47 -0700
From: Chen Yu <yu.c.chen@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>, Juri Lelli <juri.lelli@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        K Prateek Nayak <kprateek.nayak@amd.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>,
        Chen Yu <yu.chen.surf@gmail.com>,
        Aaron Lu <aaron.lu@intel.com>, linux-kernel@vger.kernel.org,
        Chen Yu <yu.c.chen@intel.com>
Subject: [RFC PATCH 7/7] sched/stats: Track the scan number of groups during
 load balance
Date: Thu, 27 Jul 2023 22:35:36 +0800
Message-Id: 
 <e0176341d5b5186f36ea024fb0f1ca02dc7082a3.1690273854.git.yu.c.chen@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1690273854.git.yu.c.chen@intel.com>
References: <cover.1690273854.git.yu.c.chen@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

This metric could be used to evaluate the load balance cost and
effeciency.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h | 1 +
 kernel/sched/fair.c            | 2 ++
 kernel/sched/stats.c           | 5 +++--
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index af2261308529..fa8fc6a497fd 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -124,6 +124,7 @@ struct sched_domain {
 	unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_nobusyq[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_sg_scan[CPU_MAX_IDLE_TYPES];
=20
 	/* Active load balancing */
 	unsigned int alb_count;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9af57b5a24dc..96df7c5706d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10253,6 +10253,8 @@ static inline void update_sd_lb_stats(struct lb_env=
 *env, struct sd_lb_stats *sd
 			goto next_group;
=20
=20
+		schedstat_inc(env->sd->lb_sg_scan[env->idle]);
+
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest =3D sg;
 			sds->busiest_stat =3D *sgs;
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..38608f791363 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -152,7 +152,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 				   cpumask_pr_args(sched_domain_span(sd)));
 			for (itype =3D CPU_IDLE; itype < CPU_MAX_IDLE_TYPES;
 					itype++) {
-				seq_printf(seq, " %u %u %u %u %u %u %u %u",
+				seq_printf(seq, " %u %u %u %u %u %u %u %u %u",
 				    sd->lb_count[itype],
 				    sd->lb_balanced[itype],
 				    sd->lb_failed[itype],
@@ -160,7 +160,8 @@ static int show_schedstat(struct seq_file *seq, void *v)
 				    sd->lb_gained[itype],
 				    sd->lb_hot_gained[itype],
 				    sd->lb_nobusyq[itype],
-				    sd->lb_nobusyg[itype]);
+				    sd->lb_nobusyg[itype],
+				    sd->lb_sg_scan[itype]);
 			}
 			seq_printf(seq,
 				   " %u %u %u %u %u %u %u %u %u %u %u %u\n",
--=20
2.25.1