From nobody Sun Sep  7 13:32:26 2025
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DE470EB64DA
	for <linux-kernel@archiver.kernel.org>; Tue, 18 Jul 2023 13:41:38 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232695AbjGRNli (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Tue, 18 Jul 2023 09:41:38 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42364 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230339AbjGRNle (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 18 Jul 2023 09:41:34 -0400
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EA928D1
        for <linux-kernel@vger.kernel.org>;
 Tue, 18 Jul 2023 06:41:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1689687693; x=1721223693;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=8vAuW7kn5P49yEV9cQ8npEoDT7TB1IR2RrGZ1EvUpxo=;
  b=S1Rel40Aym6AL47/843YZ3ZQz7zQ9ZY9NmiiDBnnKny0QlKL3Xs1ewpL
   kY9rQB/+qhrtfEEcCTj63vlDuxnXL0X8veFSDVM4sNABl6keCNlpZvXf8
   xRa8MbJa6rp+wsgfR3rTRMhjCEhPphdNJH00EHjTu15vC6/Too3JhGMng
   XEksF/C19tRJH/p+Hio7Yp/2TjPDvE49GzlROQFFkQUAXILVhrl6fK31P
   Ww2epB7YmE8BR4s8pc4EwxgLFk0SRgHyUZvRvol7XtJ15xdCuvxAw8Ia7
   NbuOB/7nNH9wrTJy1mIRzP3tgfuX04a96AOJiYux+q+huHG1iPg9h1BeJ
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="345800670"
X-IronPort-AV: E=Sophos;i="6.01,214,1684825200";
   d="scan'208";a="345800670"
Received: from orsmga004.jf.intel.com ([10.7.209.38])
  by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 18 Jul 2023 06:41:33 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="847706517"
X-IronPort-AV: E=Sophos;i="6.01,214,1684825200";
   d="scan'208";a="847706517"
Received: from ziqianlu-desk2.sh.intel.com ([10.239.159.54])
  by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 18 Jul 2023 06:41:29 -0700
From: Aaron Lu <aaron.lu@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Nitin Tekchandani <nitin.tekchandani@intel.com>,
        Yu Chen <yu.c.chen@intel.com>,
        Waiman Long <longman@redhat.com>, linux-kernel@vger.kernel.org
Subject: [PATCH 1/4] sched/fair: free allocated memory on error in
 alloc_fair_sched_group()
Date: Tue, 18 Jul 2023 21:41:17 +0800
Message-ID: <20230718134120.81199-2-aaron.lu@intel.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230718134120.81199-1-aaron.lu@intel.com>
References: <20230718134120.81199-1-aaron.lu@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

There is one struct cfs_rq and one struct se on each cpu for a taskgroup
and when allocation for tg->cfs_rq[X] failed, the already allocated
tg->cfs_rq[0]..tg->cfs_rq[X-1] should be freed. The same for tg->se.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 kernel/sched/fair.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..0f913487928d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12443,10 +12443,10 @@ int alloc_fair_sched_group(struct task_group *tg,=
 struct task_group *parent)
=20
 	tg->cfs_rq =3D kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
 	if (!tg->cfs_rq)
-		goto err;
+		return 0;
 	tg->se =3D kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
 	if (!tg->se)
-		goto err;
+		goto err_free_rq_pointer;
=20
 	tg->shares =3D NICE_0_LOAD;
=20
@@ -12456,12 +12456,12 @@ int alloc_fair_sched_group(struct task_group *tg,=
 struct task_group *parent)
 		cfs_rq =3D kzalloc_node(sizeof(struct cfs_rq),
 				      GFP_KERNEL, cpu_to_node(i));
 		if (!cfs_rq)
-			goto err;
+			goto err_free;
=20
 		se =3D kzalloc_node(sizeof(struct sched_entity_stats),
 				  GFP_KERNEL, cpu_to_node(i));
 		if (!se)
-			goto err_free_rq;
+			goto err_free;
=20
 		init_cfs_rq(cfs_rq);
 		init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
@@ -12470,9 +12470,18 @@ int alloc_fair_sched_group(struct task_group *tg, =
struct task_group *parent)
=20
 	return 1;
=20
-err_free_rq:
-	kfree(cfs_rq);
-err:
+err_free:
+	for_each_possible_cpu(i) {
+		kfree(tg->cfs_rq[i]);
+		kfree(tg->se[i]);
+
+		if (!tg->cfs_rq[i] && !tg->se[i])
+			break;
+	}
+	kfree(tg->se);
+err_free_rq_pointer:
+	kfree(tg->cfs_rq);
+
 	return 0;
 }
=20
--=20
2.41.0
From nobody Sun Sep  7 13:32:26 2025
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 77143EB64DA
	for <linux-kernel@archiver.kernel.org>; Tue, 18 Jul 2023 13:41:44 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229953AbjGRNln (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Tue, 18 Jul 2023 09:41:43 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42396 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232655AbjGRNlj (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 18 Jul 2023 09:41:39 -0400
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DCCDBE9
        for <linux-kernel@vger.kernel.org>;
 Tue, 18 Jul 2023 06:41:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1689687697; x=1721223697;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=7zyk3JWG0n/NRGusSO5HuAwJo/tzG1af2tLFBafaYXE=;
  b=Egrhh6nNGnS0m8aVSIPKnKUWoL1czDccNpNvTx/fGvri4U7VTQ7higjI
   XY7lybfDeeoY+0Wjmzb4SrgaMiLGz0W2928tl8JJN4uxjRRi65Qg1I53v
   D6HJEE/xB1ofqk1SabxECsCmlbC41O3Eik0ntgdPzh5UUHRxB6BEpY7FK
   8rV8G1EzFY7UO1Pkv6hhNhaKnL3c4pZolBnti7mBm5ShXRE62d0YlhNfJ
   Ef9+NMCcabrW1zLZMQIq1KbHchl+qp7kPiWgglnLwj3UVKSk2bX0evPJT
   T1TU32oL2jmStBOBtwQS2sQ6OaxRA1l55lajOeYs3SzzLpH5WL1ERf3NX
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="345800694"
X-IronPort-AV: E=Sophos;i="6.01,214,1684825200";
   d="scan'208";a="345800694"
Received: from orsmga004.jf.intel.com ([10.7.209.38])
  by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 18 Jul 2023 06:41:37 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="847706524"
X-IronPort-AV: E=Sophos;i="6.01,214,1684825200";
   d="scan'208";a="847706524"
Received: from ziqianlu-desk2.sh.intel.com ([10.239.159.54])
  by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 18 Jul 2023 06:41:33 -0700
From: Aaron Lu <aaron.lu@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Nitin Tekchandani <nitin.tekchandani@intel.com>,
        Yu Chen <yu.c.chen@intel.com>,
        Waiman Long <longman@redhat.com>, linux-kernel@vger.kernel.org
Subject: [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node
Date: Tue, 18 Jul 2023 21:41:18 +0800
Message-ID: <20230718134120.81199-3-aaron.lu@intel.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230718134120.81199-1-aaron.lu@intel.com>
References: <20230718134120.81199-1-aaron.lu@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

        13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
        10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side.

Tim Chen told me that PeterZ once mentioned a way to solve a similar
problem by making a counter per node so do the same for tg->load_avg.
After this change, the cost of the two functions are reduced and
sysbench transactions are increased on SPR. Below are test results.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
postgres_sysbench(transaction, higher is better)
nr_thread=3D100%/75%/50% were tested on 2 sockets SPR and Icelake and
results that have a measuable difference are:

nr_thread=3D100% on SPR
base:  90569.11=C2=B11.15%
node: 104152.26=C2=B10.34%  +15.0%

nr_thread=3D75% on SPR
base: 100803.96=C2=B10.57%
node: 107333.58=C2=B10.44%   +6.5%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
hackbench/pipe/threads/fd=3D20/loop=3D1000000 (throughput, higher is better)
group=3D1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
results that have a measuable difference are:

group=3D8 on SPR:
base:  437163=C2=B12.6%
node:  471203=C2=B11.2%   +7.8%

group=3D16 on SPR:
base:  468279=C2=B11.9%
node:  580385=C2=B11.7%  +23.9%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
netperf/TCP_STRAM
nr_thread=3D1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and there is no measuable difference.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
netperf/UDP_RR (throughput, higher is better)
nr_thread=3D1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:

nr_thread=3D75% on Cascade lake:
base:  36701=C2=B11.7%
node:  39949=C2=B11.4%   +8.8%

nr_thread=3D75% on SPR:
base:  14249=C2=B13.8%
node:  19890=C2=B12.0%   +39.6%

nr_thread=3D100% on Cascade lake
base:  52275=C2=B10.6%
node:  53827=C2=B10.4%   +3.0%

nr_thread=3D100% on SPR
base:   9560=C2=B11.6%
node:  14186=C2=B13.9%   +48.4%

Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 kernel/sched/debug.c |  2 +-
 kernel/sched/fair.c  | 29 ++++++++++++++++++++++++++---
 kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++----------
 3 files changed, 60 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 066ff1c8ae4e..3af965a18866 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -691,7 +691,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct c=
fs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
 	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
-			atomic_long_read(&cfs_rq->tg->load_avg));
+			tg_load_avg(cfs_rq->tg));
 #endif
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f913487928d..aceb8f5922cb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3496,7 +3496,7 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
=20
 	load =3D max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
=20
-	tg_weight =3D atomic_long_read(&tg->load_avg);
+	tg_weight =3D tg_load_avg(tg);
=20
 	/* Ensure tg_weight >=3D load */
 	tg_weight -=3D cfs_rq->tg_load_avg_contrib;
@@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *c=
fs_rq)
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
 	long delta =3D cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	int node =3D cpu_to_node(smp_processor_id());
=20
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -3673,7 +3674,7 @@ static inline void update_tg_load_avg(struct cfs_rq *=
cfs_rq)
 		return;
=20
 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
-		atomic_long_add(delta, &cfs_rq->tg->load_avg);
+		atomic_long_add(delta, &cfs_rq->tg->node_info[node]->load_avg);
 		cfs_rq->tg_load_avg_contrib =3D cfs_rq->avg.load_avg;
 	}
 }
@@ -12439,7 +12440,7 @@ int alloc_fair_sched_group(struct task_group *tg, s=
truct task_group *parent)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
-	int i;
+	int i, nodes;
=20
 	tg->cfs_rq =3D kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
 	if (!tg->cfs_rq)
@@ -12468,8 +12469,30 @@ int alloc_fair_sched_group(struct task_group *tg, =
struct task_group *parent)
 		init_entity_runnable_average(se);
 	}
=20
+#ifdef CONFIG_SMP
+	nodes =3D num_possible_nodes();
+	tg->node_info =3D kcalloc(nodes, sizeof(struct tg_node_info *), GFP_KERNE=
L);
+	if (!tg->node_info)
+		goto err_free;
+
+	for_each_node(i) {
+		tg->node_info[i] =3D kzalloc_node(sizeof(struct tg_node_info), GFP_KERNE=
L, i);
+		if (!tg->node_info[i])
+			goto err_free_node;
+	}
+#endif
+
 	return 1;
=20
+#ifdef CONFIG_SMP
+err_free_node:
+	for_each_node(i) {
+		kfree(tg->node_info[i]);
+		if (!tg->node_info[i])
+			break;
+	}
+	kfree(tg->node_info);
+#endif
 err_free:
 	for_each_possible_cpu(i) {
 		kfree(tg->cfs_rq[i]);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14dfaafb3a8f..9cece2dbc95b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -359,6 +359,17 @@ struct cfs_bandwidth {
 #endif
 };
=20
+struct tg_node_info {
+	/*
+	 * load_avg can be heavily contended at clock tick time and task
+	 * enqueue/dequeue time, so put it in its own cacheline separated
+	 * from other fields.
+	 */
+	struct {
+		atomic_long_t		load_avg;
+	} ____cacheline_aligned_in_smp;
+};
+
 /* Task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -373,15 +384,8 @@ struct task_group {
 	/* A positive value indicates that this is a SCHED_IDLE group. */
 	int			idle;
=20
-#ifdef	CONFIG_SMP
-	/*
-	 * load_avg can be heavily contended at clock tick time, so put
-	 * it in its own cacheline separated from the fields above which
-	 * will also be accessed at each tick.
-	 */
-	struct {
-		atomic_long_t		load_avg;
-	} ____cacheline_aligned_in_smp;
+#ifdef CONFIG_SMP
+	struct tg_node_info	**node_info;
 #endif
 #endif
=20
@@ -413,9 +417,28 @@ struct task_group {
 	/* Effective clamp values used for a task group */
 	struct uclamp_se	uclamp[UCLAMP_CNT];
 #endif
-
 };
=20
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+static inline long tg_load_avg(struct task_group *tg)
+{
+	long load_avg =3D 0;
+	int i;
+
+	/*
+	 * The only path that can give us a root_task_group
+	 * here is from print_cfs_rq() thus unlikely.
+	 */
+	if (unlikely(tg =3D=3D &root_task_group))
+		return 0;
+
+	for_each_node(i)
+		load_avg +=3D atomic_long_read(&tg->node_info[i]->load_avg);
+
+	return load_avg;
+}
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD
=20
--=20
2.41.0

From nobody Sun Sep  7 13:32:26 2025
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 17760EB64DA
	for <linux-kernel@archiver.kernel.org>; Tue, 18 Jul 2023 13:41:53 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232776AbjGRNlv (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Tue, 18 Jul 2023 09:41:51 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42452 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232771AbjGRNln (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 18 Jul 2023 09:41:43 -0400
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BAF8D18E
        for <linux-kernel@vger.kernel.org>;
 Tue, 18 Jul 2023 06:41:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1689687701; x=1721223701;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=8F3zASvnELjb+rrDSMq0b/VVlZEErX5L3+l5y/KLAD4=;
  b=lg0SDFlI9LP3Sgx4BVgQaQ/SyWDkm9qRXKgATSeCH3Gw7idHqaUnz07l
   ifPJ7284YG2ReUcPkZeBTPboy2dJAkJFQYFz8W+KkgF9BLJ1talROClsV
   1VXk+fzqizLOzEkbbEtjVFdv151F47aRFKrNAAcHiw2ePU/oZtNRGXgPs
   H7vr2uot4/koI1YZRPySTaHNMis/IE3l0ljVNby1usFvUMDlm8bsabzSw
   kziX0H2OCnQu0s2HFlmM+h8+6YFda88aByRhCxV1WzBiWPylrIJ6Ey701
   lz0oJwhjB5CX0e1kknHlgTUKursVF7Ni1LGMetNQqRvxY3kDg1LOieO9n
   w==;
X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="345800716"
X-IronPort-AV: E=Sophos;i="6.01,214,1684825200";
   d="scan'208";a="345800716"
Received: from orsmga004.jf.intel.com ([10.7.209.38])
  by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 18 Jul 2023 06:41:41 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="847706530"
X-IronPort-AV: E=Sophos;i="6.01,214,1684825200";
   d="scan'208";a="847706530"
Received: from ziqianlu-desk2.sh.intel.com ([10.239.159.54])
  by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 18 Jul 2023 06:41:37 -0700
From: Aaron Lu <aaron.lu@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Nitin Tekchandani <nitin.tekchandani@intel.com>,
        Yu Chen <yu.c.chen@intel.com>,
        Waiman Long <longman@redhat.com>, linux-kernel@vger.kernel.org
Subject: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's
 removed load
Date: Tue, 18 Jul 2023 21:41:19 +0800
Message-ID: <20230718134120.81199-4-aaron.lu@intel.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230718134120.81199-1-aaron.lu@intel.com>
References: <20230718134120.81199-1-aaron.lu@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

When a workload involves many wake time task migrations, tg->load_avg
can be heavily contended among CPUs because every migration involves
removing the task's load from its src cfs_rq and attach that load to
its new cfs_rq. Both the remove and attach requires an update to
tg->load_avg as well as propagating the change up the hierarchy.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions. Since the workload can trigger many
wakeups and migrations, the access(both read and write) to tg->load_avg
can be unbound. For the above said workload, the profile shows
update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With
netperf/nr_client=3Dnr_cpu/UDP_RR, the wakeup number is 21millions and
migration number is 14millions; update_cfs_group() costs ~25% and
update_load_avg() costs ~16%.

This patch is an attempt to reduce the cost of accessing tg->load_avg.

Current logic will immediately do a update_tg_load_avg() if cfs_rq has
removed load; this patch changes this behavior: if this cfs_rq has
removed load as discovered by update_cfs_rq_load_avg(), it didn't call
update_tg_load_avg() or propagate the removed load immediately, instead,
the update to tg->load_avg and propagated load can be dealed with by a
following event like task attached to this cfs_rq or in
update_blocked_averages(). This way, the call to update_tg_load_avg()
for this cfs_rq and its ancestors can be reduced by about half.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
postgres_sysbench(transaction, higher is better)
nr_thread=3D100%/75%/50% were tested on 2 sockets SPR and Icelake and
results that have a measuable difference are:

nr_thread=3D100% on SPR:
base:   90569.11=C2=B11.15%
node:  104152.26=C2=B10.34%  +15.0%
delay: 127309.46=C2=B14.25%  +40.6%

nr_thread=3D75% on SPR:
base:  100803.96=C2=B10.57%
node:  107333.58=C2=B10.44%   +6.5%
delay: 124332.39=C2=B10.51%  +23.3%

nr_thread=3D75% on ICL:
base:  61961.26=C2=B10.41%
node:  61585.45=C2=B10.50%
delay: 72420.52=C2=B10.14%  +16.9%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
hackbench/pipe/threads/fd=3D20/loop=3D1000000 (throughput, higher is better)
group=3D1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
results that have a measuable difference are:

group=3D8 on SPR:
base:  437163=C2=B12.6%
node:  471203=C2=B11.2%   +7.8%
delay: 490780=C2=B10.9%  +12.3%

group=3D16 on SPR:
base:  468279=C2=B11.9%
node:  580385=C2=B11.7%  +23.9%
delay: 664422=C2=B10.2%  +41.9%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
netperf/TCP_STRAM (throughput, higher is better)
nr_thread=3D1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have a measuable difference are:

nr_thread=3D50% on CSL:
base:  16258=C2=B10.7%
node:  16172=C2=B12.9%
delay: 17729=C2=B10.7%  +9.0%

nr_thread=3D75% on CSL:
base:  12923=C2=B11.2%
node:  13011=C2=B12.2%
delay: 15452=C2=B11.6%  +19.6%

nr_thread=3D75% on SPR:
base:  16232=C2=B111.9%
node:  13962=C2=B15.1%
delay: 21089=C2=B10.8%  +29.9%

nr_thread=3D100% on SPR:
base:  13220=C2=B10.6%
node:  13113=C2=B10.0%
delay: 18258=C2=B111.3% +38.1%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
netperf/UDP_RR (throughput, higher is better)
nr_thread=3D1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:

nr_thread=3D1 on CSL:
base:  128521=C2=B10.5%
node:  127935=C2=B10.6%
delay: 126317=C2=B10.4%  -1.7%

nr_thread=3D75% on CSL:
base:  36701=C2=B11.7%
node:  39949=C2=B11.4%   +8.8%
delay: 42516=C2=B10.3%  +15.8%

nr_thread=3D75% on SPR:
base:  14249=C2=B13.8%
node:  19890=C2=B12.0%   +39.6%
delay: 31331=C2=B10.5%  +119.9%

nr_thread=3D100% on CSL:
base:  52275=C2=B10.6%
node:  53827=C2=B10.4%   +3.0%
delay: 78386=C2=B10.7%  +49.9%

nr_thread=3D100% on SPR:
base:   9560=C2=B11.6%
node:  14186=C2=B13.9%   +48.4%
delay: 20779=C2=B12.8%  +117.4%

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 kernel/sched/fair.c  | 23 ++++++++++++++++++-----
 kernel/sched/sched.h |  1 +
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aceb8f5922cb..564ffe3e59c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3645,6 +3645,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *c=
fs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;
=20
+	if (cfs_rq->prop_removed_sum)
+		return false;
+
 	return true;
 }
=20
@@ -3911,6 +3914,11 @@ static inline void add_tg_cfs_propagate(struct cfs_r=
q *cfs_rq, long runnable_sum
 {
 	cfs_rq->propagate =3D 1;
 	cfs_rq->prop_runnable_sum +=3D runnable_sum;
+
+	if (cfs_rq->prop_removed_sum) {
+		cfs_rq->prop_runnable_sum +=3D cfs_rq->prop_removed_sum;
+		cfs_rq->prop_removed_sum =3D 0;
+	}
 }
=20
 /* Update task and its cfs_rq load average */
@@ -4133,13 +4141,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_=
rq)
 		 * removed_runnable is the unweighted version of removed_load so we
 		 * can use it to estimate removed_load_sum.
 		 */
-		add_tg_cfs_propagate(cfs_rq,
-			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
-
-		decayed =3D 1;
+		cfs_rq->prop_removed_sum +=3D
+			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT;
 	}
=20
-	decayed |=3D __update_load_avg_cfs_rq(now, cfs_rq);
+	decayed =3D __update_load_avg_cfs_rq(now, cfs_rq);
 	u64_u32_store_copy(sa->last_update_time,
 			   cfs_rq->last_update_time_copy,
 			   sa->last_update_time);
@@ -9001,6 +9007,13 @@ static bool __update_blocked_fair(struct rq *rq, boo=
l *done)
=20
 			if (cfs_rq =3D=3D &rq->cfs)
 				decayed =3D true;
+
+			/*
+			 * If the aggregated removed_sum hasn't been taken care of,
+			 * deal with it now before this cfs_rq is removed from the list.
+			 */
+			if (cfs_rq->prop_removed_sum)
+				add_tg_cfs_propagate(cfs_rq, 0);
 		}
=20
 		/* Propagate pending load changes to the parent, if any: */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9cece2dbc95b..ab540b21d071 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -619,6 +619,7 @@ struct cfs_rq {
 	unsigned long		tg_load_avg_contrib;
 	long			propagate;
 	long			prop_runnable_sum;
+	long			prop_removed_sum;
=20
 	/*
 	 *   h_load =3D weight * f(tg)
--=20
2.41.0

From nobody Sun Sep  7 13:32:26 2025
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 30C6EEB64DC
	for <linux-kernel@archiver.kernel.org>; Tue, 18 Jul 2023 13:42:00 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232580AbjGRNl6 (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Tue, 18 Jul 2023 09:41:58 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42560 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232842AbjGRNlv (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 18 Jul 2023 09:41:51 -0400
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DBCCB10CB
        for <linux-kernel@vger.kernel.org>;
 Tue, 18 Jul 2023 06:41:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1689687705; x=1721223705;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=RlvrBHFv6Du1879kI3EewacCpf6aQLiFMk7QITrDmtI=;
  b=nMby4OYaq5z4sAqoYkF7fAxD+rCs82nCNQZ37ezW1tsU3sCqUI5qgKbI
   d40NPZQeZ67rfWrZ1cU3yqLBNKG/uuos/9kREzD33CQlwVQZSLuFG9Cw1
   Dju5Hk5m9FwOD1ATHrYFZl/AYJ9g65ggsh1uKnFsJo4sWTJPhCtqYyLNF
   TggZAVw1/AEvoaTwh9bclrb5t0yENWZApvPpj76u6melQzVVyjGZD4M6k
   avAwnEm2a/2igRB0Bc75/fSo7fb2o3bnu8vJ/jPesoC+TuvX5Y1mk44KG
   oM/8shn33G+n/5fqW8kt0noqIXUdKaQS6iGl+0KfcOqwnrCJr2CHBuJtU
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="345800745"
X-IronPort-AV: E=Sophos;i="6.01,214,1684825200";
   d="scan'208";a="345800745"
Received: from orsmga004.jf.intel.com ([10.7.209.38])
  by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 18 Jul 2023 06:41:45 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="847706545"
X-IronPort-AV: E=Sophos;i="6.01,214,1684825200";
   d="scan'208";a="847706545"
Received: from ziqianlu-desk2.sh.intel.com ([10.239.159.54])
  by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 18 Jul 2023 06:41:41 -0700
From: Aaron Lu <aaron.lu@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Nitin Tekchandani <nitin.tekchandani@intel.com>,
        Yu Chen <yu.c.chen@intel.com>,
        Waiman Long <longman@redhat.com>, linux-kernel@vger.kernel.org
Subject: [RFC PATCH 4/4] sched/fair: skip some update_cfs_group() on
 en/dequeue_entity()
Date: Tue, 18 Jul 2023 21:41:20 +0800
Message-ID: <20230718134120.81199-5-aaron.lu@intel.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230718134120.81199-1-aaron.lu@intel.com>
References: <20230718134120.81199-1-aaron.lu@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

After the previous patch, the cost of update_cfs_group() and
update_load_avg() has dropped to around 1% for postgres_sysbench on SPR
but netperf/UDP_RR on SPR still saw ~20% update_cfs_group() and ~10%
update_load_avg() so this patch is another attempt to further reduce the
two functions' cost from read side.

The observation is: if an entity is dequeued, updating its weight isn't
useful, except that the current code will also update its cfs_rq's
load_avg using the updated weight...so removing update_cfs_group() on
dequeue path can reduce cost of accessing tg->load_avg, but it also will
reduce the tracking accuracy.

Another hint I got from an ancient commit 17bc14b767cf("Revert "sched:
Update_cfs_shares at period edge") is: if an entity is enqueued and it's
the only entity of its cfs_rq, we do not need immediately update its
weight since it's not needed to decide if it can preempt curr.

commit 17bc14b767cf mentioned a latency problem when reducing calling
frequency of update_cfs_group(): doing a make -j32 in one terminal
window will cause browsing experience worse. To see how things are now,
I did a test: two cgroups were created under root and in one group, I did
"make -j32" and in the meantime, I did "./schbench -m 1 -t 6 -r 300" in
another group on a 6core/12cpus Intel i7-8700T Coffee lake cpu and the
wakeup latency reported by schbench for base and this series doesn't look
much different:

base:
schbench -m 1 -t 6 -r 300:
Latency percentiles (usec) runtime 300 (s) (18534 total samples)
        50.0th: 20 (9491 samples)
        75.0th: 25 (4768 samples)
        90.0th: 29 (2552 samples)
        95.0th: 62 (809 samples)
        *99.0th: 20320 (730 samples)
        99.5th: 23392 (92 samples)
        99.9th: 31392 (74 samples)
        min=3D6, max=3D32032

make -j32:
real    5m35.950s
user    47m33.814s
sys     4m45.470s

this series:
schbench -m 1 -t 6 -r 300:
Latency percentiles (usec) runtime 300 (s) (18528 total samples)
        50.0th: 21 (9920 samples)
        75.0th: 26 (4756 samples)
        90.0th: 30 (2100 samples)
        95.0th: 63 (846 samples)
        *99.0th: 19040 (722 samples)
        99.5th: 21920 (92 samples)
        99.9th: 30048 (81 samples)
        min=3D6, max=3D34873

make -j32:
real    5m35.185s
user    47m28.528s
sys     4m44.705s

As for netperf/UDP_RR/nr_thread=3D100% on SPR: after this change, the two
functions' cost dropped to ~2%.

Other test results:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
postgres_sysbench(transaction, higher is better)
nr_thread=3D100%/75%/50% were tested on 2 sockets SPR and Icelake and
results that have a measuable difference are:

nr_thread=3D100% on SPR:
base:   90569.11=C2=B11.15%
node:  104152.26=C2=B10.34%  +15.0%
delay: 127309.46=C2=B14.25%  +40.6%
skip:  125501.96=C2=B11.83%  +38.6%

nr_thread=3D75% on SPR:
base:  100803.96=C2=B10.57%
node:  107333.58=C2=B10.44%   +6.5%
delay: 124332.39=C2=B10.51%  +23.3%
skip:  127676.55=C2=B10.03%  +26.7%

nr_thread=3D75% on ICL:
base:  61961.26=C2=B10.41%
node:  61585.45=C2=B10.50%
delay: 72420.52=C2=B10.14%  +16.9%
skip:  72413.23=C2=B10.30%  +16.9%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
hackbench/pipe/threads/fd=3D20/loop=3D1000000 (throughput, higher is better)
group=3D1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
results that have a measuable difference are:

group=3D8 on SPR:
base:  437163=C2=B12.6%
node:  471203=C2=B11.2%   +7.8%
delay: 490780=C2=B10.9%  +12.3%
skip:  493062=C2=B11.9%  +12.8%

group=3D16 on SPR:
base:  468279=C2=B11.9%
node:  580385=C2=B11.7%  +23.9%
delay: 664422=C2=B10.2%  +41.9%
skip:  697387=C2=B10.2%  +48.9%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
netperf/TCP_STRAM (throughput, higher is better)
nr_thread=3D1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:

nr_thread=3D50% on SPR:
base:  16258=C2=B10.7%
node:  16172=C2=B12.9%
delay: 17729=C2=B10.7%  +9.0%
skip:  17823=C2=B11.3%  +9.6%

nr_thread=3D75% on CSL:
base:  12923=C2=B11.2%
node:  13011=C2=B12.2%
delay: 15452=C2=B11.6%  +19.6%
skip:  15302=C2=B11.7%  +18.4%

nr_thread=3D75% on SPR:
base:  16232=C2=B111.9%
node:  13962=C2=B15.1%
delay: 21089=C2=B10.8%  +29.9%
skip:  21251=C2=B10.4%  +30.9%

nr_thread=3D100% on SPR:
base:  13220=C2=B10.6%
node:  13113=C2=B10.0%
delay: 18258=C2=B111.3% +38.1%
skip:  16974=C2=B112.7% +28.4%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
netperf/UDP_RR (throughput, higher is better)
nr_thread=3D1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:

nr_thread=3D25% on CSL:
base:  107269=C2=B10.3%
node:  107226=C2=B10.2%
delay: 106978=C2=B10.3%
skip:  109652=C2=B10.3%  +2.2%

nr_thread=3D50% on CSL:
base:  74854=C2=B10.1%
node:  74521=C2=B10.4%
delay: 74438=C2=B10.2%
skip:  76431=C2=B10.1%  +2.1%

nr_thread=3D75% on CSL:
base:  36701=C2=B11.7%
node:  39949=C2=B11.4%   +8.8%
delay: 42516=C2=B10.3%  +15.8%
skip:  45044=C2=B10.5%  +22.7%

nr_thread=3D75% on SPR:
base:  14249=C2=B13.8%
node:  19890=C2=B12.0%   +39.6%
delay: 31331=C2=B10.5%  +119.9%
skip:  33688=C2=B13.5%  +136.4%

nr_thread=3D100% on CSL:
base:  52275=C2=B10.6%
node:  53827=C2=B10.4%   +3.0%
delay: 78386=C2=B10.7%  +49.9%
skip:  76926=C2=B12.3%  +47.2%

nr_thread=3D100% on SPR:
base:   9560=C2=B11.6%
node:  14186=C2=B13.9%   +48.4%
delay: 20779=C2=B12.8%  +117.4%
skip:  32125=C2=B12.5%  +236.0%

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 kernel/sched/fair.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 564ffe3e59c1..0dbbb92302ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4862,7 +4862,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_en=
tity *se, int flags)
 	 */
 	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
 	se_update_runnable(se);
-	update_cfs_group(se);
+	if (cfs_rq->nr_running > 0)
+		update_cfs_group(se);
 	account_entity_enqueue(cfs_rq, se);
=20
 	if (flags & ENQUEUE_WAKEUP)
@@ -4978,8 +4979,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_en=
tity *se, int flags)
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
=20
-	update_cfs_group(se);
-
 	/*
 	 * Now advance min_vruntime if @se was the entity holding it back,
 	 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
--=20
2.41.0