From nobody Sun Sep 7 13:32:26 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DE470EB64DA for ; Tue, 18 Jul 2023 13:41:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232695AbjGRNli (ORCPT ); Tue, 18 Jul 2023 09:41:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42364 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230339AbjGRNle (ORCPT ); Tue, 18 Jul 2023 09:41:34 -0400 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EA928D1 for ; Tue, 18 Jul 2023 06:41:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1689687693; x=1721223693; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=8vAuW7kn5P49yEV9cQ8npEoDT7TB1IR2RrGZ1EvUpxo=; b=S1Rel40Aym6AL47/843YZ3ZQz7zQ9ZY9NmiiDBnnKny0QlKL3Xs1ewpL kY9rQB/+qhrtfEEcCTj63vlDuxnXL0X8veFSDVM4sNABl6keCNlpZvXf8 xRa8MbJa6rp+wsgfR3rTRMhjCEhPphdNJH00EHjTu15vC6/Too3JhGMng XEksF/C19tRJH/p+Hio7Yp/2TjPDvE49GzlROQFFkQUAXILVhrl6fK31P Ww2epB7YmE8BR4s8pc4EwxgLFk0SRgHyUZvRvol7XtJ15xdCuvxAw8Ia7 NbuOB/7nNH9wrTJy1mIRzP3tgfuX04a96AOJiYux+q+huHG1iPg9h1BeJ A==; X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="345800670" X-IronPort-AV: E=Sophos;i="6.01,214,1684825200"; d="scan'208";a="345800670" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 06:41:33 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="847706517" X-IronPort-AV: E=Sophos;i="6.01,214,1684825200"; d="scan'208";a="847706517" Received: from ziqianlu-desk2.sh.intel.com ([10.239.159.54]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 06:41:29 -0700 From: Aaron Lu To: Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot , Daniel Jordan Cc: Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Tim Chen , Nitin Tekchandani , Yu Chen , Waiman Long , linux-kernel@vger.kernel.org Subject: [PATCH 1/4] sched/fair: free allocated memory on error in alloc_fair_sched_group() Date: Tue, 18 Jul 2023 21:41:17 +0800 Message-ID: <20230718134120.81199-2-aaron.lu@intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230718134120.81199-1-aaron.lu@intel.com> References: <20230718134120.81199-1-aaron.lu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" There is one struct cfs_rq and one struct se on each cpu for a taskgroup and when allocation for tg->cfs_rq[X] failed, the already allocated tg->cfs_rq[0]..tg->cfs_rq[X-1] should be freed. The same for tg->se. Signed-off-by: Aaron Lu --- kernel/sched/fair.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a80a73909dc2..0f913487928d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12443,10 +12443,10 @@ int alloc_fair_sched_group(struct task_group *tg,= struct task_group *parent) =20 tg->cfs_rq =3D kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL); if (!tg->cfs_rq) - goto err; + return 0; tg->se =3D kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL); if (!tg->se) - goto err; + goto err_free_rq_pointer; =20 tg->shares =3D NICE_0_LOAD; =20 @@ -12456,12 +12456,12 @@ int alloc_fair_sched_group(struct task_group *tg,= struct task_group *parent) cfs_rq =3D kzalloc_node(sizeof(struct cfs_rq), GFP_KERNEL, cpu_to_node(i)); if (!cfs_rq) - goto err; + goto err_free; =20 se =3D kzalloc_node(sizeof(struct sched_entity_stats), GFP_KERNEL, cpu_to_node(i)); if (!se) - goto err_free_rq; + goto err_free; =20 init_cfs_rq(cfs_rq); init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]); @@ -12470,9 +12470,18 @@ int alloc_fair_sched_group(struct task_group *tg, = struct task_group *parent) =20 return 1; =20 -err_free_rq: - kfree(cfs_rq); -err: +err_free: + for_each_possible_cpu(i) { + kfree(tg->cfs_rq[i]); + kfree(tg->se[i]); + + if (!tg->cfs_rq[i] && !tg->se[i]) + break; + } + kfree(tg->se); +err_free_rq_pointer: + kfree(tg->cfs_rq); + return 0; } =20 --=20 2.41.0 From nobody Sun Sep 7 13:32:26 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77143EB64DA for ; Tue, 18 Jul 2023 13:41:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229953AbjGRNln (ORCPT ); Tue, 18 Jul 2023 09:41:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42396 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232655AbjGRNlj (ORCPT ); Tue, 18 Jul 2023 09:41:39 -0400 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DCCDBE9 for ; Tue, 18 Jul 2023 06:41:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1689687697; x=1721223697; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=7zyk3JWG0n/NRGusSO5HuAwJo/tzG1af2tLFBafaYXE=; b=Egrhh6nNGnS0m8aVSIPKnKUWoL1czDccNpNvTx/fGvri4U7VTQ7higjI XY7lybfDeeoY+0Wjmzb4SrgaMiLGz0W2928tl8JJN4uxjRRi65Qg1I53v D6HJEE/xB1ofqk1SabxECsCmlbC41O3Eik0ntgdPzh5UUHRxB6BEpY7FK 8rV8G1EzFY7UO1Pkv6hhNhaKnL3c4pZolBnti7mBm5ShXRE62d0YlhNfJ Ef9+NMCcabrW1zLZMQIq1KbHchl+qp7kPiWgglnLwj3UVKSk2bX0evPJT T1TU32oL2jmStBOBtwQS2sQ6OaxRA1l55lajOeYs3SzzLpH5WL1ERf3NX w==; X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="345800694" X-IronPort-AV: E=Sophos;i="6.01,214,1684825200"; d="scan'208";a="345800694" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 06:41:37 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="847706524" X-IronPort-AV: E=Sophos;i="6.01,214,1684825200"; d="scan'208";a="847706524" Received: from ziqianlu-desk2.sh.intel.com ([10.239.159.54]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 06:41:33 -0700 From: Aaron Lu To: Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot , Daniel Jordan Cc: Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Tim Chen , Nitin Tekchandani , Yu Chen , Waiman Long , linux-kernel@vger.kernel.org Subject: [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node Date: Tue, 18 Jul 2023 21:41:18 +0800 Message-ID: <20230718134120.81199-3-aaron.lu@intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230718134120.81199-1-aaron.lu@intel.com> References: <20230718134120.81199-1-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When using sysbench to benchmark Postgres in a single docker instance with sysbench's nr_threads set to nr_cpu, it is observed there are times update_cfs_group() and update_load_avg() shows noticeable overhead on a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR): 13.75% 13.74% [kernel.vmlinux] [k] update_cfs_group 10.63% 10.04% [kernel.vmlinux] [k] update_load_avg Annotate shows the cycles are mostly spent on accessing tg->load_avg with update_load_avg() being the write side and update_cfs_group() being the read side. Tim Chen told me that PeterZ once mentioned a way to solve a similar problem by making a counter per node so do the same for tg->load_avg. After this change, the cost of the two functions are reduced and sysbench transactions are increased on SPR. Below are test results. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D postgres_sysbench(transaction, higher is better) nr_thread=3D100%/75%/50% were tested on 2 sockets SPR and Icelake and results that have a measuable difference are: nr_thread=3D100% on SPR base: 90569.11=C2=B11.15% node: 104152.26=C2=B10.34% +15.0% nr_thread=3D75% on SPR base: 100803.96=C2=B10.57% node: 107333.58=C2=B10.44% +6.5% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D hackbench/pipe/threads/fd=3D20/loop=3D1000000 (throughput, higher is better) group=3D1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the results that have a measuable difference are: group=3D8 on SPR: base: 437163=C2=B12.6% node: 471203=C2=B11.2% +7.8% group=3D16 on SPR: base: 468279=C2=B11.9% node: 580385=C2=B11.7% +23.9% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D netperf/TCP_STRAM nr_thread=3D1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade Lake and there is no measuable difference. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D netperf/UDP_RR (throughput, higher is better) nr_thread=3D1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade Lake and results that have measuable difference are: nr_thread=3D75% on Cascade lake: base: 36701=C2=B11.7% node: 39949=C2=B11.4% +8.8% nr_thread=3D75% on SPR: base: 14249=C2=B13.8% node: 19890=C2=B12.0% +39.6% nr_thread=3D100% on Cascade lake base: 52275=C2=B10.6% node: 53827=C2=B10.4% +3.0% nr_thread=3D100% on SPR base: 9560=C2=B11.6% node: 14186=C2=B13.9% +48.4% Reported-by: Nitin Tekchandani Signed-off-by: Aaron Lu --- kernel/sched/debug.c | 2 +- kernel/sched/fair.c | 29 ++++++++++++++++++++++++++--- kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++---------- 3 files changed, 60 insertions(+), 14 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 066ff1c8ae4e..3af965a18866 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -691,7 +691,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct c= fs_rq *cfs_rq) SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib", cfs_rq->tg_load_avg_contrib); SEQ_printf(m, " .%-30s: %ld\n", "tg_load_avg", - atomic_long_read(&cfs_rq->tg->load_avg)); + tg_load_avg(cfs_rq->tg)); #endif #endif #ifdef CONFIG_CFS_BANDWIDTH diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0f913487928d..aceb8f5922cb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3496,7 +3496,7 @@ static long calc_group_shares(struct cfs_rq *cfs_rq) =20 load =3D max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg); =20 - tg_weight =3D atomic_long_read(&tg->load_avg); + tg_weight =3D tg_load_avg(tg); =20 /* Ensure tg_weight >=3D load */ tg_weight -=3D cfs_rq->tg_load_avg_contrib; @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *c= fs_rq) static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) { long delta =3D cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib; + int node =3D cpu_to_node(smp_processor_id()); =20 /* * No need to update load_avg for root_task_group as it is not used. @@ -3673,7 +3674,7 @@ static inline void update_tg_load_avg(struct cfs_rq *= cfs_rq) return; =20 if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) { - atomic_long_add(delta, &cfs_rq->tg->load_avg); + atomic_long_add(delta, &cfs_rq->tg->node_info[node]->load_avg); cfs_rq->tg_load_avg_contrib =3D cfs_rq->avg.load_avg; } } @@ -12439,7 +12440,7 @@ int alloc_fair_sched_group(struct task_group *tg, s= truct task_group *parent) { struct sched_entity *se; struct cfs_rq *cfs_rq; - int i; + int i, nodes; =20 tg->cfs_rq =3D kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL); if (!tg->cfs_rq) @@ -12468,8 +12469,30 @@ int alloc_fair_sched_group(struct task_group *tg, = struct task_group *parent) init_entity_runnable_average(se); } =20 +#ifdef CONFIG_SMP + nodes =3D num_possible_nodes(); + tg->node_info =3D kcalloc(nodes, sizeof(struct tg_node_info *), GFP_KERNE= L); + if (!tg->node_info) + goto err_free; + + for_each_node(i) { + tg->node_info[i] =3D kzalloc_node(sizeof(struct tg_node_info), GFP_KERNE= L, i); + if (!tg->node_info[i]) + goto err_free_node; + } +#endif + return 1; =20 +#ifdef CONFIG_SMP +err_free_node: + for_each_node(i) { + kfree(tg->node_info[i]); + if (!tg->node_info[i]) + break; + } + kfree(tg->node_info); +#endif err_free: for_each_possible_cpu(i) { kfree(tg->cfs_rq[i]); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 14dfaafb3a8f..9cece2dbc95b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -359,6 +359,17 @@ struct cfs_bandwidth { #endif }; =20 +struct tg_node_info { + /* + * load_avg can be heavily contended at clock tick time and task + * enqueue/dequeue time, so put it in its own cacheline separated + * from other fields. + */ + struct { + atomic_long_t load_avg; + } ____cacheline_aligned_in_smp; +}; + /* Task group related information */ struct task_group { struct cgroup_subsys_state css; @@ -373,15 +384,8 @@ struct task_group { /* A positive value indicates that this is a SCHED_IDLE group. */ int idle; =20 -#ifdef CONFIG_SMP - /* - * load_avg can be heavily contended at clock tick time, so put - * it in its own cacheline separated from the fields above which - * will also be accessed at each tick. - */ - struct { - atomic_long_t load_avg; - } ____cacheline_aligned_in_smp; +#ifdef CONFIG_SMP + struct tg_node_info **node_info; #endif #endif =20 @@ -413,9 +417,28 @@ struct task_group { /* Effective clamp values used for a task group */ struct uclamp_se uclamp[UCLAMP_CNT]; #endif - }; =20 +#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP) +static inline long tg_load_avg(struct task_group *tg) +{ + long load_avg =3D 0; + int i; + + /* + * The only path that can give us a root_task_group + * here is from print_cfs_rq() thus unlikely. + */ + if (unlikely(tg =3D=3D &root_task_group)) + return 0; + + for_each_node(i) + load_avg +=3D atomic_long_read(&tg->node_info[i]->load_avg); + + return load_avg; +} +#endif + #ifdef CONFIG_FAIR_GROUP_SCHED #define ROOT_TASK_GROUP_LOAD NICE_0_LOAD =20 --=20 2.41.0 From nobody Sun Sep 7 13:32:26 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 17760EB64DA for ; Tue, 18 Jul 2023 13:41:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232776AbjGRNlv (ORCPT ); Tue, 18 Jul 2023 09:41:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42452 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232771AbjGRNln (ORCPT ); Tue, 18 Jul 2023 09:41:43 -0400 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BAF8D18E for ; Tue, 18 Jul 2023 06:41:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1689687701; x=1721223701; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=8F3zASvnELjb+rrDSMq0b/VVlZEErX5L3+l5y/KLAD4=; b=lg0SDFlI9LP3Sgx4BVgQaQ/SyWDkm9qRXKgATSeCH3Gw7idHqaUnz07l ifPJ7284YG2ReUcPkZeBTPboy2dJAkJFQYFz8W+KkgF9BLJ1talROClsV 1VXk+fzqizLOzEkbbEtjVFdv151F47aRFKrNAAcHiw2ePU/oZtNRGXgPs H7vr2uot4/koI1YZRPySTaHNMis/IE3l0ljVNby1usFvUMDlm8bsabzSw kziX0H2OCnQu0s2HFlmM+h8+6YFda88aByRhCxV1WzBiWPylrIJ6Ey701 lz0oJwhjB5CX0e1kknHlgTUKursVF7Ni1LGMetNQqRvxY3kDg1LOieO9n w==; X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="345800716" X-IronPort-AV: E=Sophos;i="6.01,214,1684825200"; d="scan'208";a="345800716" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 06:41:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="847706530" X-IronPort-AV: E=Sophos;i="6.01,214,1684825200"; d="scan'208";a="847706530" Received: from ziqianlu-desk2.sh.intel.com ([10.239.159.54]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 06:41:37 -0700 From: Aaron Lu To: Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot , Daniel Jordan Cc: Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Tim Chen , Nitin Tekchandani , Yu Chen , Waiman Long , linux-kernel@vger.kernel.org Subject: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load Date: Tue, 18 Jul 2023 21:41:19 +0800 Message-ID: <20230718134120.81199-4-aaron.lu@intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230718134120.81199-1-aaron.lu@intel.com> References: <20230718134120.81199-1-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When a workload involves many wake time task migrations, tg->load_avg can be heavily contended among CPUs because every migration involves removing the task's load from its src cfs_rq and attach that load to its new cfs_rq. Both the remove and attach requires an update to tg->load_avg as well as propagating the change up the hierarchy. E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel Sappire Rapids, during a 5s window, the wakeup number is 14millions and migration number is 11millions. Since the workload can trigger many wakeups and migrations, the access(both read and write) to tg->load_avg can be unbound. For the above said workload, the profile shows update_cfs_group() costs ~13% and update_load_avg() costs ~10%. With netperf/nr_client=3Dnr_cpu/UDP_RR, the wakeup number is 21millions and migration number is 14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%. This patch is an attempt to reduce the cost of accessing tg->load_avg. Current logic will immediately do a update_tg_load_avg() if cfs_rq has removed load; this patch changes this behavior: if this cfs_rq has removed load as discovered by update_cfs_rq_load_avg(), it didn't call update_tg_load_avg() or propagate the removed load immediately, instead, the update to tg->load_avg and propagated load can be dealed with by a following event like task attached to this cfs_rq or in update_blocked_averages(). This way, the call to update_tg_load_avg() for this cfs_rq and its ancestors can be reduced by about half. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D postgres_sysbench(transaction, higher is better) nr_thread=3D100%/75%/50% were tested on 2 sockets SPR and Icelake and results that have a measuable difference are: nr_thread=3D100% on SPR: base: 90569.11=C2=B11.15% node: 104152.26=C2=B10.34% +15.0% delay: 127309.46=C2=B14.25% +40.6% nr_thread=3D75% on SPR: base: 100803.96=C2=B10.57% node: 107333.58=C2=B10.44% +6.5% delay: 124332.39=C2=B10.51% +23.3% nr_thread=3D75% on ICL: base: 61961.26=C2=B10.41% node: 61585.45=C2=B10.50% delay: 72420.52=C2=B10.14% +16.9% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D hackbench/pipe/threads/fd=3D20/loop=3D1000000 (throughput, higher is better) group=3D1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the results that have a measuable difference are: group=3D8 on SPR: base: 437163=C2=B12.6% node: 471203=C2=B11.2% +7.8% delay: 490780=C2=B10.9% +12.3% group=3D16 on SPR: base: 468279=C2=B11.9% node: 580385=C2=B11.7% +23.9% delay: 664422=C2=B10.2% +41.9% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D netperf/TCP_STRAM (throughput, higher is better) nr_thread=3D1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade Lake and results that have a measuable difference are: nr_thread=3D50% on CSL: base: 16258=C2=B10.7% node: 16172=C2=B12.9% delay: 17729=C2=B10.7% +9.0% nr_thread=3D75% on CSL: base: 12923=C2=B11.2% node: 13011=C2=B12.2% delay: 15452=C2=B11.6% +19.6% nr_thread=3D75% on SPR: base: 16232=C2=B111.9% node: 13962=C2=B15.1% delay: 21089=C2=B10.8% +29.9% nr_thread=3D100% on SPR: base: 13220=C2=B10.6% node: 13113=C2=B10.0% delay: 18258=C2=B111.3% +38.1% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D netperf/UDP_RR (throughput, higher is better) nr_thread=3D1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade Lake and results that have measuable difference are: nr_thread=3D1 on CSL: base: 128521=C2=B10.5% node: 127935=C2=B10.6% delay: 126317=C2=B10.4% -1.7% nr_thread=3D75% on CSL: base: 36701=C2=B11.7% node: 39949=C2=B11.4% +8.8% delay: 42516=C2=B10.3% +15.8% nr_thread=3D75% on SPR: base: 14249=C2=B13.8% node: 19890=C2=B12.0% +39.6% delay: 31331=C2=B10.5% +119.9% nr_thread=3D100% on CSL: base: 52275=C2=B10.6% node: 53827=C2=B10.4% +3.0% delay: 78386=C2=B10.7% +49.9% nr_thread=3D100% on SPR: base: 9560=C2=B11.6% node: 14186=C2=B13.9% +48.4% delay: 20779=C2=B12.8% +117.4% Signed-off-by: Aaron Lu --- kernel/sched/fair.c | 23 ++++++++++++++++++----- kernel/sched/sched.h | 1 + 2 files changed, 19 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index aceb8f5922cb..564ffe3e59c1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3645,6 +3645,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *c= fs_rq) if (child_cfs_rq_on_list(cfs_rq)) return false; =20 + if (cfs_rq->prop_removed_sum) + return false; + return true; } =20 @@ -3911,6 +3914,11 @@ static inline void add_tg_cfs_propagate(struct cfs_r= q *cfs_rq, long runnable_sum { cfs_rq->propagate =3D 1; cfs_rq->prop_runnable_sum +=3D runnable_sum; + + if (cfs_rq->prop_removed_sum) { + cfs_rq->prop_runnable_sum +=3D cfs_rq->prop_removed_sum; + cfs_rq->prop_removed_sum =3D 0; + } } =20 /* Update task and its cfs_rq load average */ @@ -4133,13 +4141,11 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_= rq) * removed_runnable is the unweighted version of removed_load so we * can use it to estimate removed_load_sum. */ - add_tg_cfs_propagate(cfs_rq, - -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT); - - decayed =3D 1; + cfs_rq->prop_removed_sum +=3D + -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT; } =20 - decayed |=3D __update_load_avg_cfs_rq(now, cfs_rq); + decayed =3D __update_load_avg_cfs_rq(now, cfs_rq); u64_u32_store_copy(sa->last_update_time, cfs_rq->last_update_time_copy, sa->last_update_time); @@ -9001,6 +9007,13 @@ static bool __update_blocked_fair(struct rq *rq, boo= l *done) =20 if (cfs_rq =3D=3D &rq->cfs) decayed =3D true; + + /* + * If the aggregated removed_sum hasn't been taken care of, + * deal with it now before this cfs_rq is removed from the list. + */ + if (cfs_rq->prop_removed_sum) + add_tg_cfs_propagate(cfs_rq, 0); } =20 /* Propagate pending load changes to the parent, if any: */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 9cece2dbc95b..ab540b21d071 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -619,6 +619,7 @@ struct cfs_rq { unsigned long tg_load_avg_contrib; long propagate; long prop_runnable_sum; + long prop_removed_sum; =20 /* * h_load =3D weight * f(tg) --=20 2.41.0 From nobody Sun Sep 7 13:32:26 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30C6EEB64DC for ; Tue, 18 Jul 2023 13:42:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232580AbjGRNl6 (ORCPT ); Tue, 18 Jul 2023 09:41:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42560 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232842AbjGRNlv (ORCPT ); Tue, 18 Jul 2023 09:41:51 -0400 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DBCCB10CB for ; Tue, 18 Jul 2023 06:41:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1689687705; x=1721223705; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=RlvrBHFv6Du1879kI3EewacCpf6aQLiFMk7QITrDmtI=; b=nMby4OYaq5z4sAqoYkF7fAxD+rCs82nCNQZ37ezW1tsU3sCqUI5qgKbI d40NPZQeZ67rfWrZ1cU3yqLBNKG/uuos/9kREzD33CQlwVQZSLuFG9Cw1 Dju5Hk5m9FwOD1ATHrYFZl/AYJ9g65ggsh1uKnFsJo4sWTJPhCtqYyLNF TggZAVw1/AEvoaTwh9bclrb5t0yENWZApvPpj76u6melQzVVyjGZD4M6k avAwnEm2a/2igRB0Bc75/fSo7fb2o3bnu8vJ/jPesoC+TuvX5Y1mk44KG oM/8shn33G+n/5fqW8kt0noqIXUdKaQS6iGl+0KfcOqwnrCJr2CHBuJtU g==; X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="345800745" X-IronPort-AV: E=Sophos;i="6.01,214,1684825200"; d="scan'208";a="345800745" Received: from orsmga004.jf.intel.com ([10.7.209.38]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 06:41:45 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10775"; a="847706545" X-IronPort-AV: E=Sophos;i="6.01,214,1684825200"; d="scan'208";a="847706545" Received: from ziqianlu-desk2.sh.intel.com ([10.239.159.54]) by orsmga004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jul 2023 06:41:41 -0700 From: Aaron Lu To: Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot , Daniel Jordan Cc: Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Tim Chen , Nitin Tekchandani , Yu Chen , Waiman Long , linux-kernel@vger.kernel.org Subject: [RFC PATCH 4/4] sched/fair: skip some update_cfs_group() on en/dequeue_entity() Date: Tue, 18 Jul 2023 21:41:20 +0800 Message-ID: <20230718134120.81199-5-aaron.lu@intel.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230718134120.81199-1-aaron.lu@intel.com> References: <20230718134120.81199-1-aaron.lu@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org After the previous patch, the cost of update_cfs_group() and update_load_avg() has dropped to around 1% for postgres_sysbench on SPR but netperf/UDP_RR on SPR still saw ~20% update_cfs_group() and ~10% update_load_avg() so this patch is another attempt to further reduce the two functions' cost from read side. The observation is: if an entity is dequeued, updating its weight isn't useful, except that the current code will also update its cfs_rq's load_avg using the updated weight...so removing update_cfs_group() on dequeue path can reduce cost of accessing tg->load_avg, but it also will reduce the tracking accuracy. Another hint I got from an ancient commit 17bc14b767cf("Revert "sched: Update_cfs_shares at period edge") is: if an entity is enqueued and it's the only entity of its cfs_rq, we do not need immediately update its weight since it's not needed to decide if it can preempt curr. commit 17bc14b767cf mentioned a latency problem when reducing calling frequency of update_cfs_group(): doing a make -j32 in one terminal window will cause browsing experience worse. To see how things are now, I did a test: two cgroups were created under root and in one group, I did "make -j32" and in the meantime, I did "./schbench -m 1 -t 6 -r 300" in another group on a 6core/12cpus Intel i7-8700T Coffee lake cpu and the wakeup latency reported by schbench for base and this series doesn't look much different: base: schbench -m 1 -t 6 -r 300: Latency percentiles (usec) runtime 300 (s) (18534 total samples) 50.0th: 20 (9491 samples) 75.0th: 25 (4768 samples) 90.0th: 29 (2552 samples) 95.0th: 62 (809 samples) *99.0th: 20320 (730 samples) 99.5th: 23392 (92 samples) 99.9th: 31392 (74 samples) min=3D6, max=3D32032 make -j32: real 5m35.950s user 47m33.814s sys 4m45.470s this series: schbench -m 1 -t 6 -r 300: Latency percentiles (usec) runtime 300 (s) (18528 total samples) 50.0th: 21 (9920 samples) 75.0th: 26 (4756 samples) 90.0th: 30 (2100 samples) 95.0th: 63 (846 samples) *99.0th: 19040 (722 samples) 99.5th: 21920 (92 samples) 99.9th: 30048 (81 samples) min=3D6, max=3D34873 make -j32: real 5m35.185s user 47m28.528s sys 4m44.705s As for netperf/UDP_RR/nr_thread=3D100% on SPR: after this change, the two functions' cost dropped to ~2%. Other test results: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D postgres_sysbench(transaction, higher is better) nr_thread=3D100%/75%/50% were tested on 2 sockets SPR and Icelake and results that have a measuable difference are: nr_thread=3D100% on SPR: base: 90569.11=C2=B11.15% node: 104152.26=C2=B10.34% +15.0% delay: 127309.46=C2=B14.25% +40.6% skip: 125501.96=C2=B11.83% +38.6% nr_thread=3D75% on SPR: base: 100803.96=C2=B10.57% node: 107333.58=C2=B10.44% +6.5% delay: 124332.39=C2=B10.51% +23.3% skip: 127676.55=C2=B10.03% +26.7% nr_thread=3D75% on ICL: base: 61961.26=C2=B10.41% node: 61585.45=C2=B10.50% delay: 72420.52=C2=B10.14% +16.9% skip: 72413.23=C2=B10.30% +16.9% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D hackbench/pipe/threads/fd=3D20/loop=3D1000000 (throughput, higher is better) group=3D1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the results that have a measuable difference are: group=3D8 on SPR: base: 437163=C2=B12.6% node: 471203=C2=B11.2% +7.8% delay: 490780=C2=B10.9% +12.3% skip: 493062=C2=B11.9% +12.8% group=3D16 on SPR: base: 468279=C2=B11.9% node: 580385=C2=B11.7% +23.9% delay: 664422=C2=B10.2% +41.9% skip: 697387=C2=B10.2% +48.9% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D netperf/TCP_STRAM (throughput, higher is better) nr_thread=3D1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade Lake and results that have measuable difference are: nr_thread=3D50% on SPR: base: 16258=C2=B10.7% node: 16172=C2=B12.9% delay: 17729=C2=B10.7% +9.0% skip: 17823=C2=B11.3% +9.6% nr_thread=3D75% on CSL: base: 12923=C2=B11.2% node: 13011=C2=B12.2% delay: 15452=C2=B11.6% +19.6% skip: 15302=C2=B11.7% +18.4% nr_thread=3D75% on SPR: base: 16232=C2=B111.9% node: 13962=C2=B15.1% delay: 21089=C2=B10.8% +29.9% skip: 21251=C2=B10.4% +30.9% nr_thread=3D100% on SPR: base: 13220=C2=B10.6% node: 13113=C2=B10.0% delay: 18258=C2=B111.3% +38.1% skip: 16974=C2=B112.7% +28.4% =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D netperf/UDP_RR (throughput, higher is better) nr_thread=3D1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade Lake and results that have measuable difference are: nr_thread=3D25% on CSL: base: 107269=C2=B10.3% node: 107226=C2=B10.2% delay: 106978=C2=B10.3% skip: 109652=C2=B10.3% +2.2% nr_thread=3D50% on CSL: base: 74854=C2=B10.1% node: 74521=C2=B10.4% delay: 74438=C2=B10.2% skip: 76431=C2=B10.1% +2.1% nr_thread=3D75% on CSL: base: 36701=C2=B11.7% node: 39949=C2=B11.4% +8.8% delay: 42516=C2=B10.3% +15.8% skip: 45044=C2=B10.5% +22.7% nr_thread=3D75% on SPR: base: 14249=C2=B13.8% node: 19890=C2=B12.0% +39.6% delay: 31331=C2=B10.5% +119.9% skip: 33688=C2=B13.5% +136.4% nr_thread=3D100% on CSL: base: 52275=C2=B10.6% node: 53827=C2=B10.4% +3.0% delay: 78386=C2=B10.7% +49.9% skip: 76926=C2=B12.3% +47.2% nr_thread=3D100% on SPR: base: 9560=C2=B11.6% node: 14186=C2=B13.9% +48.4% delay: 20779=C2=B12.8% +117.4% skip: 32125=C2=B12.5% +236.0% Signed-off-by: Aaron Lu --- kernel/sched/fair.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 564ffe3e59c1..0dbbb92302ad 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4862,7 +4862,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_en= tity *se, int flags) */ update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH); se_update_runnable(se); - update_cfs_group(se); + if (cfs_rq->nr_running > 0) + update_cfs_group(se); account_entity_enqueue(cfs_rq, se); =20 if (flags & ENQUEUE_WAKEUP) @@ -4978,8 +4979,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_en= tity *se, int flags) /* return excess runtime on last dequeue */ return_cfs_rq_runtime(cfs_rq); =20 - update_cfs_group(se); - /* * Now advance min_vruntime if @se was the entity holding it back, * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be --=20 2.41.0