From nobody Wed Dec 17 07:24:46 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0F493EE49A0 for ; Wed, 23 Aug 2023 06:09:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232844AbjHWGI7 (ORCPT ); Wed, 23 Aug 2023 02:08:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33204 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232814AbjHWGI4 (ORCPT ); Wed, 23 Aug 2023 02:08:56 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DAEDCE5E for ; Tue, 22 Aug 2023 23:08:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1692770931; x=1724306931; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Sdc/4hyc6MH1ZUREceiYTlItE9ovvnlsb2WUgGGeoSs=; b=cGbEffjkKBRxKU1tbRr0Fdp5WdwzNXgvSRnvRb2rL5NsywUr+rgx4Sga kwLTUtzUWzLC2qgQzZvTMzNtDJqeuI8ZLXlaWYfRotTfTQ8yGkfn0DZFm DKznPMw/2OEIUxlQoifjPsDHWNG69h8sZxSXoELIj/QVKJgS8NXl1crJR GL7+BmCGU5v/RyQmOETnlUYOwwBm0WBsmq5I0TyKWfYglIKx47t4N/2IF Zk1q6ssN97x1nhZLG/CNkw1CFBzhFE8HFRrLE8mGia67FZS0KgyBfUQ8+ X4+F6vTSVDCbOltsOwVBC6epXn9dLkOlP74RVKQQjrA79VsBRnttoXUjA g==; X-IronPort-AV: E=McAfee;i="6600,9927,10810"; a="364255272" X-IronPort-AV: E=Sophos;i="6.01,195,1684825200"; d="scan'208";a="364255272" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Aug 2023 23:08:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10810"; a="860201898" X-IronPort-AV: E=Sophos;i="6.01,195,1684825200"; d="scan'208";a="860201898" Received: from ziqianlu-desk2.sh.intel.com ([10.239.159.54]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Aug 2023 23:08:45 -0700 From: Aaron Lu To: Peter Zijlstra , Vincent Guittot , Ingo Molnar , Juri Lelli Cc: Daniel Jordan , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Tim Chen , Nitin Tekchandani , Yu Chen , Waiman Long , Deng Pan , Mathieu Desnoyers , "Gautham R . Shenoy" , David Vernet , linux-kernel@vger.kernel.org Subject: [PATCH 1/1] sched/fair: ratelimit update to tg->load_avg Date: Wed, 23 Aug 2023 14:08:32 +0800 Message-ID: <20230823060832.454842-2-aaron.lu@intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230823060832.454842-1-aaron.lu@intel.com> References: <20230823060832.454842-1-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When using sysbench to benchmark Postgres in a single docker instance with sysbench's nr_threads set to nr_cpu, it is observed there are times update_cfs_group() and update_load_avg() shows noticeable overhead on a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg Annotate shows the cycles are mostly spent on accessing tg->load_avg with update_load_avg() being the write side and update_cfs_group() being the read side. tg->load_avg is per task group and when different tasks of the same taskgroup running on different CPUs frequently access tg->load_avg, it can be heavily contended. E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions and migration number is 11millions and with each migration, the task's load will transfer from src cfs_rq to target cfs_rq and each change involves an update to tg->load_avg. Since the workload can trigger as many wakeups and migrations, the access(both read and write) to tg->load_avg can be unbound. As a result, the two mentioned functions showed noticeable overhead. With netperf/nr_client=3Dnr_cpu/UDP_RR, the problem is worse: during a 5s window, wakeup number is 21millions and migration number is 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. Reduce the overhead by limiting updates to tg->load_avg to at most once per ms. After this change, the cost of accessing tg->load_avg is greatly reduced and performance improved. Detailed test results below. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D postgres_sysbench on SPR: 25% base: 42382=C2=B119.8% patch: 50174=C2=B19.5% (noise) 50% base: 67626=C2=B11.3% patch: 67365=C2=B13.1% (noise) 75% base: 100216=C2=B11.2% patch: 112470=C2=B10.1% +12.2% 100% base: 93671=C2=B10.4% patch: 113563=C2=B10.2% +21.2% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D hackbench on ICL: group=3D1 base: 114912=C2=B15.2% patch: 117857=C2=B12.5% (noise) group=3D4 base: 359902=C2=B11.6% patch: 361685=C2=B12.7% (noise) group=3D8 base: 461070=C2=B10.8% patch: 491713=C2=B10.3% +6.6% group=3D16 base: 309032=C2=B15.0% patch: 378337=C2=B11.3% +22.4% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D hackbench on SPR: group=3D1 base: 100768=C2=B12.9% patch: 103134=C2=B12.9% (noise) group=3D4 base: 413830=C2=B112.5% patch: 378660=C2=B116.6% (noise) group=3D8 base: 436124=C2=B10.6% patch: 490787=C2=B13.2% +12.5% group=3D16 base: 457730=C2=B13.2% patch: 680452=C2=B11.3% +48.8% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D netperf/udp_rr on ICL 25% base: 114413=C2=B10.1% patch: 115111=C2=B10.0% +0.6% 50% base: 86803=C2=B10.5% patch: 86611=C2=B10.0% (noise) 75% base: 35959=C2=B15.3% patch: 49801=C2=B10.6% +38.5% 100% base: 61951=C2=B16.4% patch: 70224=C2=B10.8% +13.4% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D netperf/udp_rr on SPR 25% base: 104954=C2=B11.3% patch: 107312=C2=B12.8% (noise) 50% base: 55394=C2=B14.6% patch: 54940=C2=B17.4% (noise) 75% base: 13779=C2=B13.1% patch: 36105=C2=B11.1% +162% 100% base: 9703=C2=B13.7% patch: 28011=C2=B10.2% +189% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D netperf/tcp_stream on ICL (all in noise range) 25% base: 43092=C2=B10.1% patch: 42891=C2=B10.5% 50% base: 19278=C2=B114.9% patch: 22369=C2=B17.2% 75% base: 16822=C2=B13.0% patch: 17086=C2=B12.3% 100% base: 18216=C2=B10.6% patch: 18078=C2=B12.9% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D netperf/tcp_stream on SPR (all in noise range) 25% base: 34491=C2=B10.3% patch: 34886=C2=B10.5% 50% base: 19278=C2=B114.9% patch: 22369=C2=B17.2% 75% base: 16822=C2=B13.0% patch: 17086=C2=B12.3% 100% base: 18216=C2=B10.6% patch: 18078=C2=B12.9% Reported-by: Nitin Tekchandani Suggested-by: Vincent Guittot Signed-off-by: Aaron Lu Reviewed-by: Vincent Guittot Reviewed-by: David Vernet Reviewed-by: Mathieu Desnoyers Tested-by: Mathieu Desnoyers Tested-by: Swapnil Sapkal --- kernel/sched/fair.c | 13 ++++++++++++- kernel/sched/sched.h | 1 + 2 files changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c28206499a3d..a5462d1fcc48 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3664,7 +3664,8 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *c= fs_rq) */ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) { - long delta =3D cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; + long delta; + u64 now; =20 /* * No need to update load_avg for root_task_group as it is not used. @@ -3672,9 +3673,19 @@ static inline void update_tg_load_avg(struct cfs_rq = *cfs_rq) if (cfs_rq->tg =3D=3D &root_task_group) return; =20 + /* + * For migration heavy workload, access to tg->load_avg can be + * unbound. Limit the update rate to at most once per ms. + */ + now =3D sched_clock_cpu(cpu_of(rq_of(cfs_rq))); + if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) + return; + + delta =3D cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { atomic_long_add(delta, &cfs_rq->tg->load_avg); cfs_rq->tg_load_avg_contrib =3D cfs_rq->avg.load_avg; + cfs_rq->last_update_tg_load_avg =3D now; } } =20 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 6a8b7b9ed089..52ee7027def9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -593,6 +593,7 @@ struct cfs_rq { } removed; =20 #ifdef CONFIG_FAIR_GROUP_SCHED + u64 last_update_tg_load_avg; unsigned long tg_load_avg_contrib; long propagate; long prop_runnable_sum; --=20 2.41.0