From nobody Sun Feb  8 03:55:50 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0F493EE49A0
	for <linux-kernel@archiver.kernel.org>; Wed, 23 Aug 2023 06:09:02 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S232844AbjHWGI7 (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Wed, 23 Aug 2023 02:08:59 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33204 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S232814AbjHWGI4 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 23 Aug 2023 02:08:56 -0400
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.20])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DAEDCE5E
        for <linux-kernel@vger.kernel.org>;
 Tue, 22 Aug 2023 23:08:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1692770931; x=1724306931;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Sdc/4hyc6MH1ZUREceiYTlItE9ovvnlsb2WUgGGeoSs=;
  b=cGbEffjkKBRxKU1tbRr0Fdp5WdwzNXgvSRnvRb2rL5NsywUr+rgx4Sga
   kwLTUtzUWzLC2qgQzZvTMzNtDJqeuI8ZLXlaWYfRotTfTQ8yGkfn0DZFm
   DKznPMw/2OEIUxlQoifjPsDHWNG69h8sZxSXoELIj/QVKJgS8NXl1crJR
   GL7+BmCGU5v/RyQmOETnlUYOwwBm0WBsmq5I0TyKWfYglIKx47t4N/2IF
   Zk1q6ssN97x1nhZLG/CNkw1CFBzhFE8HFRrLE8mGia67FZS0KgyBfUQ8+
   X4+F6vTSVDCbOltsOwVBC6epXn9dLkOlP74RVKQQjrA79VsBRnttoXUjA
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10810"; a="364255272"
X-IronPort-AV: E=Sophos;i="6.01,195,1684825200";
   d="scan'208";a="364255272"
Received: from orsmga004.jf.intel.com ([10.7.209.38])
  by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 22 Aug 2023 23:08:50 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10810"; a="860201898"
X-IronPort-AV: E=Sophos;i="6.01,195,1684825200";
   d="scan'208";a="860201898"
Received: from ziqianlu-desk2.sh.intel.com ([10.239.159.54])
  by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 22 Aug 2023 23:08:45 -0700
From: Aaron Lu <aaron.lu@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Nitin Tekchandani <nitin.tekchandani@intel.com>,
        Yu Chen <yu.c.chen@intel.com>,
        Waiman Long <longman@redhat.com>,
        Deng Pan <pan.deng@intel.com>,
        Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
        "Gautham R . Shenoy" <gautham.shenoy@amd.com>,
        David Vernet <void@manifault.com>, linux-kernel@vger.kernel.org
Subject: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg
Date: Wed, 23 Aug 2023 14:08:32 +0800
Message-ID: <20230823060832.454842-2-aaron.lu@intel.com>
X-Mailer: git-send-email 2.41.0
In-Reply-To: <20230823060832.454842-1-aaron.lu@intel.com>
References: <20230823060832.454842-1-aaron.lu@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

    13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
    10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side. tg->load_avg is per task group and when different tasks
of the same taskgroup running on different CPUs frequently access
tg->load_avg, it can be heavily contended.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions and with each migration, the task's load
will transfer from src cfs_rq to target cfs_rq and each change involves
an update to tg->load_avg. Since the workload can trigger as many wakeups
and migrations, the access(both read and write) to tg->load_avg can be
unbound. As a result, the two mentioned functions showed noticeable
overhead. With netperf/nr_client=3Dnr_cpu/UDP_RR, the problem is worse:
during a 5s window, wakeup number is 21millions and migration number is
14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.

Reduce the overhead by limiting updates to tg->load_avg to at most once
per ms. After this change, the cost of accessing tg->load_avg is greatly
reduced and performance improved. Detailed test results below.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
postgres_sysbench on SPR:
25%
base:   42382=C2=B119.8%
patch:  50174=C2=B19.5%  (noise)

50%
base:   67626=C2=B11.3%
patch:  67365=C2=B13.1%  (noise)

75%
base:   100216=C2=B11.2%
patch:  112470=C2=B10.1% +12.2%

100%
base:    93671=C2=B10.4%
patch:  113563=C2=B10.2% +21.2%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
hackbench on ICL:
group=3D1
base:    114912=C2=B15.2%
patch:   117857=C2=B12.5%  (noise)

group=3D4
base:    359902=C2=B11.6%
patch:   361685=C2=B12.7%  (noise)

group=3D8
base:    461070=C2=B10.8%
patch:   491713=C2=B10.3% +6.6%

group=3D16
base:    309032=C2=B15.0%
patch:   378337=C2=B11.3% +22.4%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
hackbench on SPR:
group=3D1
base:    100768=C2=B12.9%
patch:   103134=C2=B12.9%  (noise)

group=3D4
base:    413830=C2=B112.5%
patch:   378660=C2=B116.6% (noise)

group=3D8
base:    436124=C2=B10.6%
patch:   490787=C2=B13.2% +12.5%

group=3D16
base:    457730=C2=B13.2%
patch:   680452=C2=B11.3% +48.8%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D
netperf/udp_rr on ICL
25%
base:    114413=C2=B10.1%
patch:   115111=C2=B10.0% +0.6%

50%
base:    86803=C2=B10.5%
patch:   86611=C2=B10.0%  (noise)

75%
base:    35959=C2=B15.3%
patch:   49801=C2=B10.6% +38.5%

100%
base:    61951=C2=B16.4%
patch:   70224=C2=B10.8% +13.4%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
netperf/udp_rr on SPR
25%
base:   104954=C2=B11.3%
patch:  107312=C2=B12.8%  (noise)

50%
base:    55394=C2=B14.6%
patch:   54940=C2=B17.4%  (noise)

75%
base:    13779=C2=B13.1%
patch:   36105=C2=B11.1% +162%

100%
base:     9703=C2=B13.7%
patch:   28011=C2=B10.2% +189%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
netperf/tcp_stream on ICL (all in noise range)
25%
base:    43092=C2=B10.1%
patch:   42891=C2=B10.5%

50%
base:    19278=C2=B114.9%
patch:   22369=C2=B17.2%

75%
base:    16822=C2=B13.0%
patch:   17086=C2=B12.3%

100%
base:    18216=C2=B10.6%
patch:   18078=C2=B12.9%

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
netperf/tcp_stream on SPR (all in noise range)
25%
base:    34491=C2=B10.3%
patch:   34886=C2=B10.5%

50%
base:    19278=C2=B114.9%
patch:   22369=C2=B17.2%

75%
base:    16822=C2=B13.0%
patch:   17086=C2=B12.3%

100%
base:    18216=C2=B10.6%
patch:   18078=C2=B12.9%

Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: David Vernet <void@manifault.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
---
 kernel/sched/fair.c  | 13 ++++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c28206499a3d..a5462d1fcc48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *c=
fs_rq)
  */
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
-	long delta =3D cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	long delta;
+	u64 now;
=20
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq =
*cfs_rq)
 	if (cfs_rq->tg =3D=3D &root_task_group)
 		return;
=20
+	/*
+	 * For migration heavy workload, access to tg->load_avg can be
+	 * unbound. Limit the update rate to at most once per ms.
+	 */
+	now =3D sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC)
+		return;
+
+	delta =3D cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
 		atomic_long_add(delta, &cfs_rq->tg->load_avg);
 		cfs_rq->tg_load_avg_contrib =3D cfs_rq->avg.load_avg;
+		cfs_rq->last_update_tg_load_avg =3D now;
 	}
 }
=20
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6a8b7b9ed089..52ee7027def9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -593,6 +593,7 @@ struct cfs_rq {
 	} removed;
=20
 #ifdef CONFIG_FAIR_GROUP_SCHED
+	u64			last_update_tg_load_avg;
 	unsigned long		tg_load_avg_contrib;
 	long			propagate;
 	long			prop_runnable_sum;
--=20
2.41.0