From nobody Sun Dec 14 12:12:52 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB43EEB64DD for ; Thu, 27 Jul 2023 06:39:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232183AbjG0GjE (ORCPT ); Thu, 27 Jul 2023 02:39:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35850 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231743AbjG0GjC (ORCPT ); Thu, 27 Jul 2023 02:39:02 -0400 Received: from mgamail.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 16CE5211C for ; Wed, 26 Jul 2023 23:39:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1690439942; x=1721975942; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=l95sHoi/sZGqMzpe1qc3INiDxsmM2Z+gfXv8zvoNc08=; b=KsNuKrgWn4bNZz+B6W/RX9Gd8g4XsNqd7JYZsvSy/tbbnvn/b/xy5YJz hjvbV2/B2AKnGlRL+s3cEE8ACdwtCZm2nVdC1tuDZ4VgiU/7jWgnuh0oV 7XaBdGAhkoeZYhtrRiSL9FUkNisQK8Yqgq/gDdsCzBcmcD/OOYvLizm+C Fg+nSj9Qk10zMAxSuAVXTBexFzsLP/MAjvJiVcYi2G5gXlfDGbxpLZRAx MSDyFQE92ex5hsKP087/LPTzm0utQc72Ih/gV71Z7aEml0V3HrXa8uSrA OAfhjkl9VhCmGgOTu39kFVIR6dEtEpvmXp+T9DqAeTpdosvq9JcYlhOjk Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="454589391" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="454589391" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2023 23:38:36 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="792191821" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="792191821" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by fmsmga008.fm.intel.com with ESMTP; 26 Jul 2023 23:38:33 -0700 From: Chen Yu To: Peter Zijlstra , Vincent Guittot Cc: Ingo Molnar , Juri Lelli , Tim Chen , Mel Gorman , Dietmar Eggemann , K Prateek Nayak , "Gautham R . Shenoy" , Chen Yu , Aaron Lu , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 1/7] sched/topology: Assign sd_share for all non NUMA sched domains Date: Thu, 27 Jul 2023 22:34:22 +0800 Message-Id: <169500eaa13198382765027eb047e6c7a0e5a13e.1690273854.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Currently, only the domain with SD_SHARE_PKG_RESOURCES flag would share 1 sd_share for every CPU in this domain. Remove this restriction and extend it for other sched domains under NUMA domain. This shared field will be used by a later patch which optimizes newidle balancing. Suggested-by: "Gautham R. Shenoy" Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Chen Yu --- kernel/sched/topology.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index d3a3b2646ec4..64212f514765 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1641,10 +1641,10 @@ sd_init(struct sched_domain_topology_level *tl, } =20 /* - * For all levels sharing cache; connect a sched_domain_shared + * For all levels except for NUMA; connect a sched_domain_shared * instance. */ - if (sd->flags & SD_SHARE_PKG_RESOURCES) { + if (!(sd->flags & SD_NUMA)) { sd->shared =3D *per_cpu_ptr(sdd->sds, sd_id); atomic_inc(&sd->shared->ref); atomic_set(&sd->shared->nr_busy_cpus, sd_weight); --=20 2.25.1 From nobody Sun Dec 14 12:12:52 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78936EB64DD for ; Thu, 27 Jul 2023 06:38:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231997AbjG0Gi6 (ORCPT ); Thu, 27 Jul 2023 02:38:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35692 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231809AbjG0Giz (ORCPT ); Thu, 27 Jul 2023 02:38:55 -0400 Received: from mgamail.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E73B42126 for ; Wed, 26 Jul 2023 23:38:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1690439931; x=1721975931; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=U+i30E359f/fjXNg0MSHQoaPJ3cOlGNvB3S19we3sqc=; b=k1qb8VbYFgI87j71HDWzhVM/DQVHefJ3Va6C4cyTT4w0Z3NF4ikJgOyc 5TyFOYRL2hM4ifQOGiz5PlKR0Pfw8SHUttosxaJRVuwf0DQQfbC9ZEFbY PGbI1E9GpK9MBqUikK0EzQ0bc0W5vyF7sOk+FRVYMg/VICoTCE+S+9CVt a1oZzKlacmDt8FuGXPg0HCzPq0MZCw2B5fAMQkNd/ri9bTgxPp3mHyQW9 7pKeF1qwTZoSrqIkmOykAAYJg5RbcEytyl5w4RXJKsvMVadLNeWhkCN0O xRZaEIsI3qX8i1Twskch5+bOzOxV6QB8yvxedEcpwNfSJWwrv1L95Z9jv g==; X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="347829661" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="347829661" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2023 23:38:51 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="973430014" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="973430014" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by fmsmga006.fm.intel.com with ESMTP; 26 Jul 2023 23:38:48 -0700 From: Chen Yu To: Peter Zijlstra , Vincent Guittot Cc: Ingo Molnar , Juri Lelli , Tim Chen , Mel Gorman , Dietmar Eggemann , K Prateek Nayak , "Gautham R . Shenoy" , Chen Yu , Aaron Lu , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 2/7] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups Date: Thu, 27 Jul 2023 22:34:36 +0800 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Record the number of sched groups within each sched domain. Prepare for newidle_balance() scan depth calculation introduced by ILB_UTIL. Signed-off-by: Chen Yu --- include/linux/sched/topology.h | 1 + kernel/sched/topology.c | 10 ++++++++-- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 67b573d5bf28..c07f2f00317a 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -152,6 +152,7 @@ struct sched_domain { struct sched_domain_shared *shared; =20 unsigned int span_weight; + unsigned int nr_groups; /* * Span of all CPUs in this domain. * diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 64212f514765..56dc564fc9a3 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1023,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, i= nt cpu) struct cpumask *covered =3D sched_domains_tmpmask; struct sd_data *sdd =3D sd->private; struct sched_domain *sibling; - int i; + int i, nr_groups =3D 0; =20 cpumask_clear(covered); =20 @@ -1087,6 +1087,8 @@ build_overlap_sched_groups(struct sched_domain *sd, i= nt cpu) if (!sg) goto fail; =20 + nr_groups++; + sg_span =3D sched_group_span(sg); cpumask_or(covered, covered, sg_span); =20 @@ -1100,6 +1102,7 @@ build_overlap_sched_groups(struct sched_domain *sd, i= nt cpu) last->next =3D first; } sd->groups =3D first; + sd->nr_groups =3D nr_groups; =20 return 0; =20 @@ -1233,7 +1236,7 @@ build_sched_groups(struct sched_domain *sd, int cpu) struct sd_data *sdd =3D sd->private; const struct cpumask *span =3D sched_domain_span(sd); struct cpumask *covered; - int i; + int i, nr_groups =3D 0; =20 lockdep_assert_held(&sched_domains_mutex); covered =3D sched_domains_tmpmask; @@ -1248,6 +1251,8 @@ build_sched_groups(struct sched_domain *sd, int cpu) =20 sg =3D get_group(i, sdd); =20 + nr_groups++; + cpumask_or(covered, covered, sched_group_span(sg)); =20 if (!first) @@ -1258,6 +1263,7 @@ build_sched_groups(struct sched_domain *sd, int cpu) } last->next =3D first; sd->groups =3D first; + sd->nr_groups =3D nr_groups; =20 return 0; } --=20 2.25.1 From nobody Sun Dec 14 12:12:52 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E337C001DC for ; Thu, 27 Jul 2023 06:39:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232378AbjG0GjG (ORCPT ); Thu, 27 Jul 2023 02:39:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35904 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232193AbjG0GjF (ORCPT ); Thu, 27 Jul 2023 02:39:05 -0400 Received: from mgamail.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4007B211C for ; Wed, 26 Jul 2023 23:39:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1690439944; x=1721975944; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=gUXqBihJZoZ1Pmai0lJ204qnaiDMkb9JwZjj8InTaZc=; b=cm8VOyHLWxH3hb0BSCMV0gGOajMCl02SxzyxctQkizFHtR7MdAXflfwj c+S7owqC4aX8TFAZDwOWbfnL+LBHnLeVxiD7bKHoJdj23PR7rl8zblkgh krrTlCjUUuId+wW4OJGa+wjuNZcyiGebyvUVLv5KNBlFGEUPYmz9HDFN5 WKNV4nS2Wjab1lXWVTOCzPa7Uw34A1RkzNssxOM38PkLd24940DGkDABC ODn65DxWNBA2H29q38/HArs5+ipfnbMfdyZU6QNEWqLl/OxDNGU7kGhw1 9/fsWjbH/zQB7OzQ+7SZGPUpTqaooPymNyEC2LFP70sQdXM8FUuzZ6XFl Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="347829721" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="347829721" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2023 23:39:03 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="973430138" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="973430138" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by fmsmga006.fm.intel.com with ESMTP; 26 Jul 2023 23:39:00 -0700 From: Chen Yu To: Peter Zijlstra , Vincent Guittot Cc: Ingo Molnar , Juri Lelli , Tim Chen , Mel Gorman , Dietmar Eggemann , K Prateek Nayak , "Gautham R . Shenoy" , Chen Yu , Aaron Lu , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 3/7] sched/fair: Save a snapshot of sched domain total_load and total_capacity Date: Thu, 27 Jul 2023 22:34:50 +0800 Message-Id: <0d71de8648889fe8b202be376e97d581ff3f12ed.1690273854.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Save the total_load, total_capacity of the current sched domain in each periodic load balance. This statistic can be used later by CPU_NEWLY_IDLE load balance if it quits the scan earlier. Introduce a sched feature ILB_SNAPSHOT to control this. Code can check if sd_share->total_capacity is non-zero to verify if the stat is valid. In theory, if the system has reached a stable status, the total_capacity and total_load should not change dramatically. Signed-off-by: Chen Yu --- include/linux/sched/topology.h | 2 ++ kernel/sched/fair.c | 25 +++++++++++++++++++++++++ kernel/sched/features.h | 2 ++ 3 files changed, 29 insertions(+) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index c07f2f00317a..d6a64a2c92aa 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -82,6 +82,8 @@ struct sched_domain_shared { atomic_t nr_busy_cpus; int has_idle_cores; int nr_idle_scan; + unsigned long total_load; + unsigned long total_capacity; }; =20 struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b3e25be58e2b..edcfee9965cd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10132,6 +10132,27 @@ static void update_idle_cpu_scan(struct lb_env *en= v, WRITE_ONCE(sd_share->nr_idle_scan, (int)y); } =20 +static void ilb_save_stats(struct lb_env *env, + struct sched_domain_shared *sd_share, + struct sd_lb_stats *sds) +{ + if (!sched_feat(ILB_SNAPSHOT)) + return; + + if (!sd_share) + return; + + /* newidle balance is too frequent */ + if (env->idle =3D=3D CPU_NEWLY_IDLE) + return; + + if (sds->total_load !=3D sd_share->total_load) + WRITE_ONCE(sd_share->total_load, sds->total_load); + + if (sds->total_capacity !=3D sd_share->total_capacity) + WRITE_ONCE(sd_share->total_capacity, sds->total_capacity); +} + /** * update_sd_lb_stats - Update sched_domain's statistics for load balancin= g. * @env: The load balancing environment. @@ -10140,6 +10161,7 @@ static void update_idle_cpu_scan(struct lb_env *env, =20 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_sta= ts *sds) { + struct sched_domain_shared *sd_share =3D env->sd->shared; struct sched_group *sg =3D env->sd->groups; struct sg_lb_stats *local =3D &sds->local_stat; struct sg_lb_stats tmp_sgs; @@ -10209,6 +10231,9 @@ static inline void update_sd_lb_stats(struct lb_env= *env, struct sd_lb_stats *sd } =20 update_idle_cpu_scan(env, sum_util); + + /* save a snapshot of stats during periodic load balance */ + ilb_save_stats(env, sd_share, sds); } =20 /** diff --git a/kernel/sched/features.h b/kernel/sched/features.h index ee7f23c76bd3..3cb71c8cddc0 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -101,3 +101,5 @@ SCHED_FEAT(LATENCY_WARN, false) =20 SCHED_FEAT(ALT_PERIOD, true) SCHED_FEAT(BASE_SLICE, true) + +SCHED_FEAT(ILB_SNAPSHOT, true) --=20 2.25.1 From nobody Sun Dec 14 12:12:52 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E40A6EB64DD for ; Thu, 27 Jul 2023 06:39:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230120AbjG0Gjy (ORCPT ); Thu, 27 Jul 2023 02:39:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36598 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232769AbjG0Gjk (ORCPT ); Thu, 27 Jul 2023 02:39:40 -0400 Received: from mgamail.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 590D82691 for ; Wed, 26 Jul 2023 23:39:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1690439978; x=1721975978; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=QzTClbz4y9JgCaxQftHb02vD+LpQaxd1CzkR8WFoPIc=; b=Jj+p/CZwq4Xakgbg+72efVluWfNCyi2bbd8OtadglPtPtmaAmIo0wzOx 4JBdBVAjEN+EbXASbc26zInFmS6mm7yXrTAtZEsc1cuRnGQ8IBSxpYQ+Y +1NDyhGsr6NjjiarQXnF1urZ1gl96TvfNVfpS/mhU4h/i2kyWTexpISuj u7lyxp2tOPPzh4dKcKzvDLy6ABIwHt/Im+/oTGaY/Eyv/z5jZQInASJt2 pb3cZdxVKhi0mLW8gTg/eHTJb+D5tWEKmMzat5nqhrafRk4JCZ+midZSO gp4yp7W1vukSNcvkguctASKt4uJe201yhWc6YT6SZRJw4V8pC6fPwZOKg w==; X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="347829789" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="347829789" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2023 23:39:14 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="973430346" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="973430346" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by fmsmga006.fm.intel.com with ESMTP; 26 Jul 2023 23:39:11 -0700 From: Chen Yu To: Peter Zijlstra , Vincent Guittot Cc: Ingo Molnar , Juri Lelli , Tim Chen , Mel Gorman , Dietmar Eggemann , K Prateek Nayak , "Gautham R . Shenoy" , Chen Yu , Aaron Lu , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 4/7] sched/fair: Calculate the scan depth for idle balance based on system utilization Date: Thu, 27 Jul 2023 22:35:02 +0800 Message-Id: <61e6fce60ca738215b6e5ad9033fb692c3a8fbb1.1690273854.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" When the CPU is about to enter idle, it invokes newidle_balance() to pull some tasks from other runqueues. Although there is per domain max_newidle_lb_cost to throttle the newidle_balance(), it would be good to further limit the scan based on overall system utilization. The reason is that there is no limitation for newidle_balance() to launch this balance simultaneously on multiple CPUs. Since each newidle_balance() has to traverse all the groups to calculate the statistics one by one, this total time cost on newidle_balance() could be O(n^2). n is the number of groups. This issue is more severe if there are many groups within 1 domain, for example, a system with a large number of Cores in a LLC domain. This is not good for performance or power saving. sqlite has spent quite some time on newidle balance() on Intel Sapphire Rapids, which has 2 x 56C/112T =3D 224 CPUs: 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats Based on this observation, limit the scan depth of newidle_balance() by considering the utilization of the sched domain. Let the number of scanned groups be a linear function of the utilization ratio: nr_groups_to_scan =3D nr_groups * (1 - util_ratio) Suggested-by: Tim Chen Signed-off-by: Chen Yu --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++++ kernel/sched/features.h | 1 + 3 files changed, 32 insertions(+) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index d6a64a2c92aa..af2261308529 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -84,6 +84,7 @@ struct sched_domain_shared { int nr_idle_scan; unsigned long total_load; unsigned long total_capacity; + int nr_sg_scan; }; =20 struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index edcfee9965cd..6925813db59b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10153,6 +10153,35 @@ static void ilb_save_stats(struct lb_env *env, WRITE_ONCE(sd_share->total_capacity, sds->total_capacity); } =20 +static void update_ilb_group_scan(struct lb_env *env, + unsigned long sum_util, + struct sched_domain_shared *sd_share) +{ + u64 tmp, nr_scan; + + if (!sched_feat(ILB_UTIL)) + return; + + if (!sd_share) + return; + + if (env->idle =3D=3D CPU_NEWLY_IDLE) + return; + + /* + * Limit the newidle balance scan depth based on overall system + * utilization: + * nr_groups_scan =3D nr_groups * (1 - util_ratio) + * and util_ratio =3D sum_util / (sd_weight * SCHED_CAPACITY_SCALE) + */ + nr_scan =3D env->sd->nr_groups * sum_util; + tmp =3D env->sd->span_weight * SCHED_CAPACITY_SCALE; + do_div(nr_scan, tmp); + nr_scan =3D env->sd->nr_groups - nr_scan; + if ((int)nr_scan !=3D sd_share->nr_sg_scan) + WRITE_ONCE(sd_share->nr_sg_scan, (int)nr_scan); +} + /** * update_sd_lb_stats - Update sched_domain's statistics for load balancin= g. * @env: The load balancing environment. @@ -10231,6 +10260,7 @@ static inline void update_sd_lb_stats(struct lb_env= *env, struct sd_lb_stats *sd } =20 update_idle_cpu_scan(env, sum_util); + update_ilb_group_scan(env, sum_util, sd_share); =20 /* save a snapshot of stats during periodic load balance */ ilb_save_stats(env, sd_share, sds); diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 3cb71c8cddc0..30f6d1a2f235 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -103,3 +103,4 @@ SCHED_FEAT(ALT_PERIOD, true) SCHED_FEAT(BASE_SLICE, true) =20 SCHED_FEAT(ILB_SNAPSHOT, true) +SCHED_FEAT(ILB_UTIL, true) --=20 2.25.1 From nobody Sun Dec 14 12:12:52 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B60A7C001DC for ; Thu, 27 Jul 2023 06:39:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232753AbjG0Gjj (ORCPT ); Thu, 27 Jul 2023 02:39:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35988 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232203AbjG0Gjf (ORCPT ); Thu, 27 Jul 2023 02:39:35 -0400 Received: from mgamail.intel.com (mga11.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0B4D01FFC for ; Wed, 26 Jul 2023 23:39:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1690439967; x=1721975967; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=hP7V3JXYo3iPIjXM13Kdhb3K2U4aYRXUttqz2K7TE7g=; b=Y1Va/xU0wMdKNv/XcKgr94xeKXMR6Hk6HwHXcceFLp5y4yMHXF4h5/Q6 2uJa0uroGVzcg3S7143J9x9t9CM40pj8AhR5S35ZjDYZcPUXdwE5J45Bf 9lfOvoUVvdI96eZHD8EXqBru6zN0P66Hs94dUiSZ3H8KwAF+2o8aZyIMD gjFoUiQEf7fpuD1XtUz8zMfIi6TJN2hTbBshOv7rb6bm4YuZyEjTBbHlz q8Q6km2De72srt6Pt34CjizIpMZ0uC0srBcaeBlKWkIPx2YgBWF2omFJm dt8CdWwoZTXLAIuePBaEp/bX+zryHQN/z31rRsZxzrv1twXEEV4NmuiYx g==; X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="365681560" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="365681560" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2023 23:39:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="816993073" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="816993073" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by FMSMGA003.fm.intel.com with ESMTP; 26 Jul 2023 23:39:22 -0700 From: Chen Yu To: Peter Zijlstra , Vincent Guittot Cc: Ingo Molnar , Juri Lelli , Tim Chen , Mel Gorman , Dietmar Eggemann , K Prateek Nayak , "Gautham R . Shenoy" , Chen Yu , Aaron Lu , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 5/7] sched/fair: Adjust the busiest group scanning depth in idle load balance Date: Thu, 27 Jul 2023 22:35:13 +0800 Message-Id: <98e26a26832669b4293a50a701f9b3b8d44e4863.1690273854.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Scanning the whole sched domain to find the busiest group is time costly during newidle_balance(). And if a CPU becomes idle, it would be good if this idle CPU pulls some tasks from other CPUs as quickly as possible. Limit the scan depth of newidle_balance() to only scan for a limited number of sched groups to find a relatively busy group, and pull from it. In summary, the more spare time there is in the domain, the more time each newidle balance can spend on scanning for a busy group. Although the newidle balance has per domain max_newidle_lb_cost to decide whether to launch the balance or not, the ILB_UTIL provides a smaller granularity to decide how many groups each newidle balance can scan. The scanning depth is calculated by the previous periodic load balance based on its overall utilization. Tested on top of v6.5-rc2, Sapphire Rapids with 2 x 56C/112T =3D 224 CPUs. With cpufreq governor set to performance, and C6 disabled. Firstly, tested on a extreme synthetic test[1], which launches 224 process. Each process is a loop of nanosleep(1 us), which is supposed to trigger newidle balance as much as possible: i=3D1;while [ $i -le "224" ]; do ./nano_sleep 1000 & i=3D$(($i+1)); done; NO_ILB_UTIL + ILB_SNAPSHOT: 9.38% 0.45% [kernel.kallsyms] [k] newidle_balance 6.84% 5.32% [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0 ILB_UTIL + ILB_SNAPSHOT: 3.35% 0.38% [kernel.kallsyms] [k] newidle_balance 2.30% 1.81% [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0 With ILB_UTIL enabled, the total number of newidle_balance() and update_sd_lb() drops. But the reason for why there are less newidle balance has not been investigated. According to the low util_avg value in /sys/kernel/debug/sched/debug, there should be no much impact on the nanosleep stress test. Test in a wider range: [netperf] Launches nr instances of: netperf -4 -H 127.0.0.1 -t $work_mode -c -C -l 100 & nr: 56, 112, 168, 224, 280, 336, 392, 448 work_mode: TCP_RR UDP_RR throughput =3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) TCP_RR 56-threads 1.00 ( 5.15) -3.96 ( 2.17) TCP_RR 112-threads 1.00 ( 2.84) -0.82 ( 2.24) TCP_RR 168-threads 1.00 ( 2.11) -0.03 ( 2.31) TCP_RR 224-threads 1.00 ( 1.76) +0.01 ( 2.12) TCP_RR 280-threads 1.00 ( 62.46) +56.56 ( 56.91) TCP_RR 336-threads 1.00 ( 19.81) +0.27 ( 17.90) TCP_RR 392-threads 1.00 ( 30.85) +0.13 ( 29.09) TCP_RR 448-threads 1.00 ( 39.71) -18.82 ( 45.93) UDP_RR 56-threads 1.00 ( 2.08) -0.31 ( 7.89) UDP_RR 112-threads 1.00 ( 3.22) -0.50 ( 15.19) UDP_RR 168-threads 1.00 ( 11.77) +0.37 ( 10.30) UDP_RR 224-threads 1.00 ( 14.03) +0.25 ( 12.88) UDP_RR 280-threads 1.00 ( 16.83) -0.57 ( 15.34) UDP_RR 336-threads 1.00 ( 22.57) +0.01 ( 24.68) UDP_RR 392-threads 1.00 ( 33.89) +2.65 ( 33.89) UDP_RR 448-threads 1.00 ( 44.18) +0.81 ( 41.28) Considering the std%, there is no much difference to netperf. [tbench] tbench -t 100 $job 127.0.0.1 job: 56, 112, 168, 224, 280, 336, 392, 448 throughput =3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) loopback 56-threads 1.00 ( 2.20) -0.09 ( 2.05) loopback 112-threads 1.00 ( 0.29) -0.88 ( 0.10) loopback 168-threads 1.00 ( 0.02) +62.92 ( 54.57) loopback 224-threads 1.00 ( 0.05) +234.30 ( 1.81) loopback 280-threads 1.00 ( 0.08) -0.11 ( 0.21) loopback 336-threads 1.00 ( 0.17) -0.17 ( 0.08) loopback 392-threads 1.00 ( 0.14) -0.09 ( 0.18) loopback 448-threads 1.00 ( 0.24) -0.53 ( 0.55) There are improvement of tbench in 224 threads case. [hackbench] hackbench -g $job --$work_type --pipe -l 200000 -s 100 -f 28 and hackbench -g $job --$work_type -l 200000 -s 100 -f 28 job: 1, 2, 4, 8 work_type: process threads throughput =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) process-pipe 1-groups 1.00 ( 0.20) +1.57 ( 0.58) process-pipe 2-groups 1.00 ( 3.53) +2.99 ( 2.03) process-pipe 4-groups 1.00 ( 1.07) +0.17 ( 1.64) process-sockets 1-groups 1.00 ( 0.36) -0.04 ( 1.44) process-sockets 2-groups 1.00 ( 0.84) +0.65 ( 1.65) process-sockets 4-groups 1.00 ( 0.04) +0.89 ( 0.08) threads-pipe 1-groups 1.00 ( 3.62) -0.53 ( 1.67) threads-pipe 2-groups 1.00 ( 4.17) -4.79 ( 0.53) threads-pipe 4-groups 1.00 ( 5.30) +5.06 ( 1.95) threads-sockets 1-groups 1.00 ( 0.40) +1.44 ( 0.53) threads-sockets 2-groups 1.00 ( 2.54) +2.21 ( 2.51) threads-sockets 4-groups 1.00 ( 0.05) +1.29 ( 0.05) No much difference of hackbench. [schbench(old)] schbench -m $job -t 56 -r 30 job: 1, 2, 4, 8 3 iterations 99.0th latency =3D=3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) normal 1-mthreads 1.00 ( 0.56) -0.91 ( 0.32) normal 2-mthreads 1.00 ( 0.95) -4.05 ( 3.63) normal 4-mthreads 1.00 ( 4.04) -0.30 ( 2.35) No much difference of schbench. [Limitation] In the previous version, Prateek reported a regression. That could be due to the concurrent access across the Numa node, or ILB_UTIL did not scan hard enough to pull from the busiest group. The former issue is fixed by not enabling ILB_UTIL for Numa domain. If there is still regression in this version, we can leverage the result of SIS_UTIL, to provide a quadratic function rather than the linear function, to scan harder when the system is idle. Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/stress_na= nosleep.c #1 Suggested-by: Tim Chen Signed-off-by: Chen Yu --- kernel/sched/fair.c | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6925813db59b..4e360ed16e14 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10195,7 +10195,13 @@ static inline void update_sd_lb_stats(struct lb_en= v *env, struct sd_lb_stats *sd struct sg_lb_stats *local =3D &sds->local_stat; struct sg_lb_stats tmp_sgs; unsigned long sum_util =3D 0; - int sg_status =3D 0; + int sg_status =3D 0, nr_sg_scan; + /* only newidle CPU can load the snapshot */ + bool ilb_can_load =3D env->idle =3D=3D CPU_NEWLY_IDLE && + sd_share && READ_ONCE(sd_share->total_capacity); + + if (sched_feat(ILB_UTIL) && ilb_can_load) + nr_sg_scan =3D sd_share->nr_sg_scan; =20 do { struct sg_lb_stats *sgs =3D &tmp_sgs; @@ -10222,6 +10228,9 @@ static inline void update_sd_lb_stats(struct lb_env= *env, struct sd_lb_stats *sd sds->busiest_stat =3D *sgs; } =20 + if (sched_feat(ILB_UTIL) && ilb_can_load && --nr_sg_scan <=3D 0) + goto load_snapshot; + next_group: /* Now, start updating sd_lb_stats */ sds->total_load +=3D sgs->group_load; @@ -10231,6 +10240,15 @@ static inline void update_sd_lb_stats(struct lb_en= v *env, struct sd_lb_stats *sd sg =3D sg->next; } while (sg !=3D env->sd->groups); =20 + ilb_can_load =3D false; + +load_snapshot: + if (ilb_can_load) { + /* borrow the statistic of previous periodic load balance */ + sds->total_load =3D READ_ONCE(sd_share->total_load); + sds->total_capacity =3D READ_ONCE(sd_share->total_capacity); + } + /* * Indicate that the child domain of the busiest group prefers tasks * go to a child's sibling domains first. NB the flags of a sched group --=20 2.25.1 From nobody Sun Dec 14 12:12:52 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 27F9BEB64DD for ; Thu, 27 Jul 2023 06:39:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232849AbjG0Gj6 (ORCPT ); Thu, 27 Jul 2023 02:39:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36652 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232787AbjG0Gjl (ORCPT ); Thu, 27 Jul 2023 02:39:41 -0400 Received: from mgamail.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6CF8026A4 for ; Wed, 26 Jul 2023 23:39:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1690439979; x=1721975979; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qlxopXTa2XESRuc0f7vj5sTVsk6dibKkwBX9ydkssyI=; b=X1p47Bdw0Xel/zgU8EPZeb8N74Ppt9zvbyQEk/HzW/oUIXAN93SGzla2 dVgQ6Ua/8MAiB8Fa0vejwTf7lx5CkO0TtHKu+hX7CQ+6N2jxXrGwfbPMo b2SNO6blzLX0/LWGFIEFnefPhPJ2DNy6nUX+xvjU43TqPktu0xPGntpfi Emqboa1iZnkU/pbiz8E3mXWv7vHFSMfJcHefNvu5VoB+nE7XV6/DwgcRY 9sV6aoWFdFjr9fhWtJvS+mT0GS1m2+TlziRRMhwyqCaiA8Mwat9IIvoTZ TRBpkolB5wRURxaMoZVwta5rJTgqiJRSuNBqEFTPwYaCFn7+QNNUHJmL3 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="454589597" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="454589597" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2023 23:39:38 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="792191903" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="792191903" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by fmsmga008.fm.intel.com with ESMTP; 26 Jul 2023 23:39:33 -0700 From: Chen Yu To: Peter Zijlstra , Vincent Guittot Cc: Ingo Molnar , Juri Lelli , Tim Chen , Mel Gorman , Dietmar Eggemann , K Prateek Nayak , "Gautham R . Shenoy" , Chen Yu , Aaron Lu , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 6/7] sched/fair: Pull from a relatively busy group during newidle balance Date: Thu, 27 Jul 2023 22:35:24 +0800 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Scanning the whole sched domain to find the busiest group is time costly during newidle_balance() on a high core count system. Introduce ILB_FAST to lower the bar during the busiest group scanning. If the target sched group is relatively busier than the local group, terminate the scan and try to pull from that group directly. Compared between ILB_UTIL and ILB_FAST, the former inhibits the sched group scan when the system is busy. While the latter choose a compromised busy group when the system is not busy. So they are complementary to each other and work independently. Tested on top of v6.5-rc2, Sapphire Rapids with 2 x 56C/112T =3D 224 CPUs. With cpufreq governor set to performance, and C6 disabled. Firstly, tested on an extreme synthetic test[1] borrowed from Tianyou. It launches 224 process. Each process is a loop of nanosleep(1 us), which is supposed to trigger newidle balance frequently: i=3D1;while [ $i -le "224" ]; do ./nano_sleep 1000 & i=3D$(($i+1)); done; [ILB_SNAPSHOT + NO_ILB_UTIL + NO_ILB_FAST] Check the /proc/schedstat delta on CPU8 within 5 seconds using the following script[2] by running: schedstat.py -i 5 -c 8 Mon Jul 24 23:43:43 2023 cpu8 .domain0.CPU_IDLE.lb_balanced 843 .domain0.CPU_IDLE.lb_count 843 .domain0.CPU_IDLE.lb_nobusyg 843 .domain0.CPU_IDLE.lb_sg_scan 843 .domain0.CPU_NEWLY_IDLE.lb_balanced 836 .domain0.CPU_NEWLY_IDLE.lb_count 837 .domain0.CPU_NEWLY_IDLE.lb_gained 1 .domain0.CPU_NEWLY_IDLE.lb_imbalance 1 .domain0.CPU_NEWLY_IDLE.lb_nobusyg 836 .domain0.CPU_NEWLY_IDLE.lb_sg_scan 837 .domain1.CPU_IDLE.lb_balanced 41 .domain1.CPU_IDLE.lb_count 41 .domain1.CPU_IDLE.lb_nobusyg 39 .domain1.CPU_IDLE.lb_sg_scan 2145 .domain1.CPU_NEWLY_IDLE.lb_balanced 732 <----- .domain1.CPU_NEWLY_IDLE.lb_count 822 <----- .domain1.CPU_NEWLY_IDLE.lb_failed 90 .domain1.CPU_NEWLY_IDLE.lb_imbalance 90 .domain1.CPU_NEWLY_IDLE.lb_nobusyg 497 .domain1.CPU_NEWLY_IDLE.lb_nobusyq 235 .domain1.CPU_NEWLY_IDLE.lb_sg_scan 45210 <----- .domain1.ttwu_wake_remote 626 .domain2.CPU_IDLE.lb_balanced 15 .domain2.CPU_IDLE.lb_count 15 .domain2.CPU_NEWLY_IDLE.lb_balanced 635 .domain2.CPU_NEWLY_IDLE.lb_count 655 .domain2.CPU_NEWLY_IDLE.lb_failed 20 .domain2.CPU_NEWLY_IDLE.lb_imbalance 40 .domain2.CPU_NEWLY_IDLE.lb_nobusyg 633 .domain2.CPU_NEWLY_IDLE.lb_nobusyq 2 .domain2.CPU_NEWLY_IDLE.lb_sg_scan 655 .stats.rq_cpu_time 227910772 .stats.rq_sched_info.pcount 89393 .stats.rq_sched_info.run_delay 2145671 .stats.sched_count 178783 .stats.sched_goidle 89390 .stats.ttwu_count 89392 .stats.ttwu_local 88766 For domain1, there are 822 newidle balance attempt, and the total number of groups scanned is 45210, thus each balance would scan for 55 groups. During this 822 balance, 732 becomes(or are already) balanced, so the effect balance success ratio is (822 - 732) / 822 =3D 10.94% The perf: 9.38% 0.45% [kernel.kallsyms] [k] newidle_balance 6.84% 5.32% [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0 [ILB_SNAPSHOT + NO_ILB_UTIL + ILB_FAST] Mon Jul 24 23:43:50 2023 cpu8 .domain0.CPU_IDLE.lb_balanced 918 .domain0.CPU_IDLE.lb_count 918 .domain0.CPU_IDLE.lb_nobusyg 918 .domain0.CPU_IDLE.lb_sg_scan 918 .domain0.CPU_NEWLY_IDLE.lb_balanced 1536 .domain0.CPU_NEWLY_IDLE.lb_count 1545 .domain0.CPU_NEWLY_IDLE.lb_failed 1 .domain0.CPU_NEWLY_IDLE.lb_gained 8 .domain0.CPU_NEWLY_IDLE.lb_imbalance 9 .domain0.CPU_NEWLY_IDLE.lb_nobusyg 1536 .domain0.CPU_NEWLY_IDLE.lb_sg_scan 1545 .domain1.CPU_IDLE.lb_balanced 45 .domain1.CPU_IDLE.lb_count 45 .domain1.CPU_IDLE.lb_nobusyg 43 .domain1.CPU_IDLE.lb_sg_scan 2365 .domain1.CPU_NEWLY_IDLE.lb_balanced 1196 <------ .domain1.CPU_NEWLY_IDLE.lb_count 1496 <------ .domain1.CPU_NEWLY_IDLE.lb_failed 296 .domain1.CPU_NEWLY_IDLE.lb_gained 4 .domain1.CPU_NEWLY_IDLE.lb_imbalance 301 .domain1.CPU_NEWLY_IDLE.lb_nobusyg 1182 .domain1.CPU_NEWLY_IDLE.lb_nobusyq 14 .domain1.CPU_NEWLY_IDLE.lb_sg_scan 30127 <------ .domain1.ttwu_wake_remote 2688 .domain2.CPU_IDLE.lb_balanced 13 .domain2.CPU_IDLE.lb_count 13 .domain2.CPU_NEWLY_IDLE.lb_balanced 898 .domain2.CPU_NEWLY_IDLE.lb_count 904 .domain2.CPU_NEWLY_IDLE.lb_failed 6 .domain2.CPU_NEWLY_IDLE.lb_imbalance 11 .domain2.CPU_NEWLY_IDLE.lb_nobusyg 896 .domain2.CPU_NEWLY_IDLE.lb_nobusyq 2 .domain2.CPU_NEWLY_IDLE.lb_sg_scan 904 .stats.rq_cpu_time 239830575 .stats.rq_sched_info.pcount 90879 .stats.rq_sched_info.run_delay 2436461 .stats.sched_count 181732 .stats.sched_goidle 90853 .stats.ttwu_count 90880 .stats.ttwu_local 88192 With ILB_FAST enabled, the CPU_NEWLY_IDLE in domain1 on CPU8 is 1496, and the total number of groups scanned is 30127. For each load balance, it will scan for 20 groups, which is only half of the 56 groups in a domain. During this 1496 balance, 1196 are balanced, so the effect balance success ratio is (1496 - 1196) / 1496 =3D 20.95%, which is higher than 10.94% when ILB_FAST is disabled. perf profile: 2.95% 0.38% [kernel.kallsyms] [k] newidle_balance 2.00% 1.51% [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0 With ILB_FAST enabled, the total update_sd_lb_stats() has dropped a lot. More benchmark results are shown below. Baseline is ILB_SNAPSHOT + NO_ILB_UTIL, to compare with ILB_SNAPSHOT + NO_ILB_UTIL + ILB_FAST [netperf] Launches nr instances of: netperf -4 -H 127.0.0.1 -t $work_mode -c -C -l 100 & nr: 56, 112, 168, 224, 280, 336, 392, 448 work_mode: TCP_RR UDP_RR throughput =3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) TCP_RR 56-threads 1.00 ( 1.83) +4.25 ( 5.15) TCP_RR 112-threads 1.00 ( 2.19) +0.96 ( 2.84) TCP_RR 168-threads 1.00 ( 1.92) -0.04 ( 2.11) TCP_RR 224-threads 1.00 ( 1.98) -0.03 ( 1.76) TCP_RR 280-threads 1.00 ( 63.11) -7.59 ( 62.46) TCP_RR 336-threads 1.00 ( 18.44) -0.45 ( 19.81) TCP_RR 392-threads 1.00 ( 26.49) -0.09 ( 30.85) TCP_RR 448-threads 1.00 ( 40.47) -0.28 ( 39.71) UDP_RR 56-threads 1.00 ( 1.83) -0.31 ( 2.08) UDP_RR 112-threads 1.00 ( 13.77) +3.58 ( 3.22) UDP_RR 168-threads 1.00 ( 10.97) -0.08 ( 11.77) UDP_RR 224-threads 1.00 ( 12.83) -0.04 ( 14.03) UDP_RR 280-threads 1.00 ( 13.89) +0.35 ( 16.83) UDP_RR 336-threads 1.00 ( 24.91) +1.38 ( 22.57) UDP_RR 392-threads 1.00 ( 34.86) -0.91 ( 33.89) UDP_RR 448-threads 1.00 ( 40.63) +0.70 ( 44.18) [tbench] tbench -t 100 $job 127.0.0.1 job: 56, 112, 168, 224, 280, 336, 392, 448 throughput =3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) loopback 56-threads 1.00 ( 0.89) +1.51 ( 2.20) loopback 112-threads 1.00 ( 0.03) +1.15 ( 0.29) loopback 168-threads 1.00 ( 53.55) -37.92 ( 0.02) loopback 224-threads 1.00 ( 61.24) -43.18 ( 0.01) loopback 280-threads 1.00 ( 0.04) +0.33 ( 0.08) loopback 336-threads 1.00 ( 0.35) +0.40 ( 0.17) loopback 392-threads 1.00 ( 0.61) +0.49 ( 0.14) loopback 448-threads 1.00 ( 0.08) +0.01 ( 0.24) [schbench] schbench -m $job -t 56 -r 30 job: 1, 2, 4, 8 3 iterations 99.0th latency =3D=3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) normal 1-mthreads 1.00 ( 0.56) -0.45 ( 0.32) normal 2-mthreads 1.00 ( 0.95) +1.01 ( 3.45) normal 4-mthreads 1.00 ( 4.04) -0.60 ( 1.26) [hackbench] hackbench -g $job --$work_type --pipe -l 200000 -s 100 -f 28 and hackbench -g $job --$work_type -l 200000 -s 100 -f 28 job: 1, 2, 4, 8 work_type: process threads throughput =3D=3D=3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) process-pipe 1-groups 1.00 ( 0.20) +2.30 ( 0.26) process-pipe 2-groups 1.00 ( 3.53) +6.14 ( 2.45) process-pipe 4-groups 1.00 ( 1.07) -4.58 ( 2.58) process-sockets 1-groups 1.00 ( 0.36) +0.75 ( 1.22) process-sockets 2-groups 1.00 ( 0.84) +1.26 ( 1.11) process-sockets 4-groups 1.00 ( 0.04) +0.97 ( 0.11) threads-pipe 1-groups 1.00 ( 3.62) +3.22 ( 2.64) threads-pipe 2-groups 1.00 ( 4.17) +5.85 ( 7.53) threads-pipe 4-groups 1.00 ( 5.30) -4.14 ( 5.39) threads-sockets 1-groups 1.00 ( 0.40) +3.50 ( 3.13) threads-sockets 2-groups 1.00 ( 2.54) +1.79 ( 0.80) threads-sockets 4-groups 1.00 ( 0.05) +1.33 ( 0.03) Considering the std%, there is no much score difference noticed. It probably indicates that ILB_FAST has reduced the cost of newidle balance without hurting the performance. Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/stress_na= nosleep.c #1 Link: https://raw.githubusercontent.com/chen-yu-surf/tools/master/schedstat= .py #2 Suggested-by: Tim Chen Signed-off-by: Chen Yu --- kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++++ kernel/sched/features.h | 1 + 2 files changed, 38 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4e360ed16e14..9af57b5a24dc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10182,6 +10182,36 @@ static void update_ilb_group_scan(struct lb_env *e= nv, WRITE_ONCE(sd_share->nr_sg_scan, (int)nr_scan); } =20 +static bool can_pull_busiest(struct sg_lb_stats *local, + struct sg_lb_stats *busiest) +{ + /* + * Check if the local group can pull from the 'busiest' + * group directly. When reaching here, update_sd_pick_busiest() + * has already filtered a candidate. + * The scan in newidle load balance on high core count system + * is costly, thus provide this shortcut to find a relative busy + * group rather than the busiest one. + * + * Only enable this shortcut when the local group is quite + * idle. This is because the total cost of newidle_balance() + * becomes severe when multiple CPUs fall into idle and launch + * newidle_balance() concurrently. And that usually indicates + * a group_has_spare status. + */ + if (local->group_type !=3D group_has_spare) + return false; + + if (busiest->idle_cpus > local->idle_cpus) + return false; + + if (busiest->idle_cpus =3D=3D local->idle_cpus && + busiest->sum_nr_running <=3D local->sum_nr_running) + return false; + + return true; +} + /** * update_sd_lb_stats - Update sched_domain's statistics for load balancin= g. * @env: The load balancing environment. @@ -10226,6 +10256,13 @@ static inline void update_sd_lb_stats(struct lb_en= v *env, struct sd_lb_stats *sd if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest =3D sg; sds->busiest_stat =3D *sgs; + /* + * Check if this busiest group can be pulled by the + * local group directly. + */ + if (sched_feat(ILB_FAST) && ilb_can_load && + can_pull_busiest(local, sgs)) + goto load_snapshot; } =20 if (sched_feat(ILB_UTIL) && ilb_can_load && --nr_sg_scan <=3D 0) diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 30f6d1a2f235..4d67e0abb78c 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -104,3 +104,4 @@ SCHED_FEAT(BASE_SLICE, true) =20 SCHED_FEAT(ILB_SNAPSHOT, true) SCHED_FEAT(ILB_UTIL, true) +SCHED_FEAT(ILB_FAST, true) --=20 2.25.1 From nobody Sun Dec 14 12:12:52 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DA3DAC001DC for ; Thu, 27 Jul 2023 06:40:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231614AbjG0GkP (ORCPT ); Thu, 27 Jul 2023 02:40:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36810 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232805AbjG0Gj7 (ORCPT ); Thu, 27 Jul 2023 02:39:59 -0400 Received: from mgamail.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 81C8B2129 for ; Wed, 26 Jul 2023 23:39:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1690439991; x=1721975991; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=2tUSZMdSmW/MZmHWfmtvUAfg03unaamp6z41Ty0I/Fk=; b=VE5KPygDoUrmAUBxeFBAT5ujtTGSW5Z/OEjLOMSsJUbawqLLaQlI0GLq S7eoIH9sesQ46aAcR0F8qVfS33QLnFXj7paRz6qQ/w71cAkQyD6SBD5+F vrawCUyWVkMfZl34kZODn9PBw91WfkbqP1ZcrgUV5bfs7nVAndwe75V8g 7kVa1FodNeO3o7GRO0FonPSdrJBAaiIzYFP63uG5jBRQRf8WblUSWcF4+ sCLOUod7+M9yN0QfS+9S3TIgShKjHFcfK4LqEAMyjEAJKw4Ce17grrmRn IYyqUEU1sBkRZlLjwRd/DyR6jww8IcBCIaBLXgCndLm0aaB5n0uFOoYCH Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="454589658" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="454589658" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jul 2023 23:39:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10783"; a="792191955" X-IronPort-AV: E=Sophos;i="6.01,234,1684825200"; d="scan'208";a="792191955" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by fmsmga008.fm.intel.com with ESMTP; 26 Jul 2023 23:39:47 -0700 From: Chen Yu To: Peter Zijlstra , Vincent Guittot Cc: Ingo Molnar , Juri Lelli , Tim Chen , Mel Gorman , Dietmar Eggemann , K Prateek Nayak , "Gautham R . Shenoy" , Chen Yu , Aaron Lu , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 7/7] sched/stats: Track the scan number of groups during load balance Date: Thu, 27 Jul 2023 22:35:36 +0800 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This metric could be used to evaluate the load balance cost and effeciency. Signed-off-by: Chen Yu --- include/linux/sched/topology.h | 1 + kernel/sched/fair.c | 2 ++ kernel/sched/stats.c | 5 +++-- 3 files changed, 6 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index af2261308529..fa8fc6a497fd 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -124,6 +124,7 @@ struct sched_domain { unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES]; unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES]; unsigned int lb_nobusyq[CPU_MAX_IDLE_TYPES]; + unsigned int lb_sg_scan[CPU_MAX_IDLE_TYPES]; =20 /* Active load balancing */ unsigned int alb_count; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9af57b5a24dc..96df7c5706d1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10253,6 +10253,8 @@ static inline void update_sd_lb_stats(struct lb_env= *env, struct sd_lb_stats *sd goto next_group; =20 =20 + schedstat_inc(env->sd->lb_sg_scan[env->idle]); + if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest =3D sg; sds->busiest_stat =3D *sgs; diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c index 857f837f52cb..38608f791363 100644 --- a/kernel/sched/stats.c +++ b/kernel/sched/stats.c @@ -152,7 +152,7 @@ static int show_schedstat(struct seq_file *seq, void *v) cpumask_pr_args(sched_domain_span(sd))); for (itype =3D CPU_IDLE; itype < CPU_MAX_IDLE_TYPES; itype++) { - seq_printf(seq, " %u %u %u %u %u %u %u %u", + seq_printf(seq, " %u %u %u %u %u %u %u %u %u", sd->lb_count[itype], sd->lb_balanced[itype], sd->lb_failed[itype], @@ -160,7 +160,8 @@ static int show_schedstat(struct seq_file *seq, void *v) sd->lb_gained[itype], sd->lb_hot_gained[itype], sd->lb_nobusyq[itype], - sd->lb_nobusyg[itype]); + sd->lb_nobusyg[itype], + sd->lb_sg_scan[itype]); } seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u %u\n", --=20 2.25.1