From nobody Sat Feb 7 22:07:21 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA9E9C7EE2E for ; Mon, 12 Jun 2023 08:23:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230450AbjFLIXt (ORCPT ); Mon, 12 Jun 2023 04:23:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59242 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231974AbjFLIXY (ORCPT ); Mon, 12 Jun 2023 04:23:24 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2E158171F for ; Mon, 12 Jun 2023 01:23:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686558182; x=1718094182; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=74mS8zfJCxwIJez6rBlG1jbdweme2WYHaeh5GQle0UU=; b=C1w5zqoJLtpTFB3zK3tp6ZKOeWi7CRu1A6iNOpqqQrJ0/eipAt4KXDFn BheCazUS2oiz1tfFbDATT/bZoXWJArgLcQsd1QjnoKLWhzph9J74sa20P imVBIsu9aAMGUGc0zsxgdIXBB9MzMeDdBi6pmLaReSiPsLqoWIS9zSu39 ThfnJN4v3PCBUCmV/lsEeJhbBn0c9Qk7VTkz70VRt3pycXM2Je12TE97w FkfFalL+aefUF3Tb7tkOUB7h0xaEoWPhxBjy06b4ZTS07qIPgA8fOZ9cj IMqxeN52qvaqb2lYtgsdAGlXuJAhmxf7p/USj0NKtv7UgbXD9Nuz3+Zle w==; X-IronPort-AV: E=McAfee;i="6600,9927,10738"; a="356861561" X-IronPort-AV: E=Sophos;i="6.00,236,1681196400"; d="scan'208";a="356861561" Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 01:23:01 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10738"; a="823892226" X-IronPort-AV: E=Sophos;i="6.00,236,1681196400"; d="scan'208";a="823892226" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by fmsmga002.fm.intel.com with ESMTP; 12 Jun 2023 01:22:56 -0700 From: Chen Yu To: Peter Zijlstra , Vincent Guittot , Ingo Molnar , Juri Lelli Cc: Tim Chen , Mel Gorman , Dietmar Eggemann , K Prateek Nayak , Abel Wu , "Gautham R . Shenoy" , Len Brown , Chen Yu , Yicong Yang , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 1/4] sched/fair: Extract the function to get the sd_llc_shared Date: Tue, 13 Jun 2023 00:18:19 +0800 Message-Id: <49789cee643fcef7827d2602af35f1198e8a28d0.1686554037.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Introduce get_llc_shared() to get the sd_llc_shared of dst_cpu, if the current domain is in LLC. Let SIS_UTIL be the first user to use this function. Prepare for later use by ILB_UTIL. No functional change is intended. Signed-off-by: Chen Yu --- kernel/sched/fair.c | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6189d1a45635..b3a24aead848 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10043,10 +10043,21 @@ find_idlest_group(struct sched_domain *sd, struct= task_struct *p, int this_cpu) return idlest; } =20 +/* Get the LLC shared information of dst CPU if doing balance in LLC */ +static struct sched_domain_shared *get_llc_shared(struct lb_env *env) +{ + struct sched_domain_shared *sd_share =3D NULL; + + if (per_cpu(sd_llc_size, env->dst_cpu) =3D=3D env->sd->span_weight) + sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu)); + + return sd_share; +} + static void update_idle_cpu_scan(struct lb_env *env, - unsigned long sum_util) + unsigned long sum_util, + struct sched_domain_shared *sd_share) { - struct sched_domain_shared *sd_share; int llc_weight, pct; u64 x, y, tmp; /* @@ -10060,14 +10071,11 @@ static void update_idle_cpu_scan(struct lb_env *e= nv, if (!sched_feat(SIS_UTIL) || env->idle =3D=3D CPU_NEWLY_IDLE) return; =20 - llc_weight =3D per_cpu(sd_llc_size, env->dst_cpu); - if (env->sd->span_weight !=3D llc_weight) - return; - - sd_share =3D rcu_dereference(per_cpu(sd_llc_shared, env->dst_cpu)); if (!sd_share) return; =20 + llc_weight =3D per_cpu(sd_llc_size, env->dst_cpu); + /* * The number of CPUs to search drops as sum_util increases, when * sum_util hits 85% or above, the scan stops. @@ -10122,6 +10130,7 @@ static void update_idle_cpu_scan(struct lb_env *env, =20 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_sta= ts *sds) { + struct sched_domain_shared *sd_share =3D get_llc_shared(env); struct sched_group *sg =3D env->sd->groups; struct sg_lb_stats *local =3D &sds->local_stat; struct sg_lb_stats tmp_sgs; @@ -10190,7 +10199,7 @@ static inline void update_sd_lb_stats(struct lb_env= *env, struct sd_lb_stats *sd trace_sched_overutilized_tp(rd, SG_OVERUTILIZED); } =20 - update_idle_cpu_scan(env, sum_util); + update_idle_cpu_scan(env, sum_util, sd_share); } =20 /** --=20 2.25.1 From nobody Sat Feb 7 22:07:21 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4435BC7EE43 for ; Mon, 12 Jun 2023 09:02:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231922AbjFLJBV (ORCPT ); Mon, 12 Jun 2023 05:01:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57342 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232559AbjFLI7f (ORCPT ); Mon, 12 Jun 2023 04:59:35 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A9F0A4C2C for ; Mon, 12 Jun 2023 01:56:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686560203; x=1718096203; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=3WaBk3ipuUhPJfycoWD1IThEDoF7rWzFGbKxq/ZZIWk=; b=IB0RhV3mt20qESCUUItJwuNT/t18h30Jb7KKXZQKps/+oq0vAHvAp7O7 Je2NGRwruZAAO9Hp0WmWSSkQIw5pTjyeDyuDeXGSQEnyYHF0PLiB/hPvw ywK8WG4qTldIIXtQ3F2AYjJyhLUTnDAq29RuVCQxPXGuHKnOJmuaVFpPK 21iI8Bvzjdztv6jq/Qdx2Hncf5yfADTwbJPTDe2oSoi8wTZGG00AOC5Ps TiTECX5MlvXYZYIfLVdlB3nz19wSwxrSnEScvhq52baJ4E74nOiP4NKSc LwYaI3h6LW8KdS5j6vsbNTvRuAZsV+6P08ES0wjf2qCJaSEeb7Q5pZ5zc Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10738"; a="361339526" X-IronPort-AV: E=Sophos;i="6.00,236,1681196400"; d="scan'208";a="361339526" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 01:23:16 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10738"; a="705297130" X-IronPort-AV: E=Sophos;i="6.00,236,1681196400"; d="scan'208";a="705297130" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by orsmga007.jf.intel.com with ESMTP; 12 Jun 2023 01:23:13 -0700 From: Chen Yu To: Peter Zijlstra , Vincent Guittot , Ingo Molnar , Juri Lelli Cc: Tim Chen , Mel Gorman , Dietmar Eggemann , K Prateek Nayak , Abel Wu , "Gautham R . Shenoy" , Len Brown , Chen Yu , Yicong Yang , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 2/4] sched/topology: Introduce nr_groups in sched_domain to indicate the number of groups Date: Tue, 13 Jun 2023 00:18:42 +0800 Message-Id: <4f7926d0d392ae88ae57815cca6a0369c8cf7cb8.1686554037.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Record the number of sched groups within each sched domain. Prepare for newidle_balance() scan depth calculation. Signed-off-by: Chen Yu --- include/linux/sched/topology.h | 1 + kernel/sched/topology.c | 10 ++++++++-- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 816df6cc444e..1faececd5694 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -152,6 +152,7 @@ struct sched_domain { struct sched_domain_shared *shared; =20 unsigned int span_weight; + unsigned int nr_groups; /* * Span of all CPUs in this domain. * diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index ca4472281c28..255606e88956 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1023,7 +1023,7 @@ build_overlap_sched_groups(struct sched_domain *sd, i= nt cpu) struct cpumask *covered =3D sched_domains_tmpmask; struct sd_data *sdd =3D sd->private; struct sched_domain *sibling; - int i; + int i, nr_groups =3D 0; =20 cpumask_clear(covered); =20 @@ -1087,6 +1087,8 @@ build_overlap_sched_groups(struct sched_domain *sd, i= nt cpu) if (!sg) goto fail; =20 + nr_groups++; + sg_span =3D sched_group_span(sg); cpumask_or(covered, covered, sg_span); =20 @@ -1100,6 +1102,7 @@ build_overlap_sched_groups(struct sched_domain *sd, i= nt cpu) last->next =3D first; } sd->groups =3D first; + sd->nr_groups =3D nr_groups; =20 return 0; =20 @@ -1233,7 +1236,7 @@ build_sched_groups(struct sched_domain *sd, int cpu) struct sd_data *sdd =3D sd->private; const struct cpumask *span =3D sched_domain_span(sd); struct cpumask *covered; - int i; + int i, nr_groups =3D 0; =20 lockdep_assert_held(&sched_domains_mutex); covered =3D sched_domains_tmpmask; @@ -1248,6 +1251,8 @@ build_sched_groups(struct sched_domain *sd, int cpu) =20 sg =3D get_group(i, sdd); =20 + nr_groups++; + cpumask_or(covered, covered, sched_group_span(sg)); =20 if (!first) @@ -1258,6 +1263,7 @@ build_sched_groups(struct sched_domain *sd, int cpu) } last->next =3D first; sd->groups =3D first; + sd->nr_groups =3D nr_groups; =20 return 0; } --=20 2.25.1 From nobody Sat Feb 7 22:07:21 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B7048C7EE43 for ; Mon, 12 Jun 2023 09:01:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232300AbjFLJBb (ORCPT ); Mon, 12 Jun 2023 05:01:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57500 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233176AbjFLI7h (ORCPT ); Mon, 12 Jun 2023 04:59:37 -0400 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A99D94C2B for ; Mon, 12 Jun 2023 01:56:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686560203; x=1718096203; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=1S7CEreYkujbhd8UXLiXtrj03LtBXe5debd6HtwHXho=; b=GcEZPIrRGacI3I17W2rC7nj3mQBvwWh+LQAcVmCusTO87SikRhKXCehv xxLKy8ZTZVx6yvRb8BTvkBloPnsVfW6VZDLPvbX2EndKBNXu9e2/3O7OT 6m8B7Dm5XunbQyAYA9fNYVruQyRID+tcL06ESC47o20N+uTOFPWAFO8ni 3m8gvos1gru4Yldf6Z//l6BQf7opYAB8H5Ca0mKqX3Suz2WC9ivjL6t/k 4espuzwSAsYsL48rn8GoGW4UI/RilH6tWNNQ60o4vIT/JlQg9G5GknRac rkjVTh8u/k9MXuEiyQmJRfEB0NShff2PUyZ/pe2F8dXKSKIAAK0rI4z1u A==; X-IronPort-AV: E=McAfee;i="6600,9927,10738"; a="361339562" X-IronPort-AV: E=Sophos;i="6.00,236,1681196400"; d="scan'208";a="361339562" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 01:23:32 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10738"; a="705297150" X-IronPort-AV: E=Sophos;i="6.00,236,1681196400"; d="scan'208";a="705297150" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by orsmga007.jf.intel.com with ESMTP; 12 Jun 2023 01:23:28 -0700 From: Chen Yu To: Peter Zijlstra , Vincent Guittot , Ingo Molnar , Juri Lelli Cc: Tim Chen , Mel Gorman , Dietmar Eggemann , K Prateek Nayak , Abel Wu , "Gautham R . Shenoy" , Len Brown , Chen Yu , Yicong Yang , linux-kernel@vger.kernel.org, Chen Yu Subject: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization Date: Tue, 13 Jun 2023 00:18:57 +0800 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" When CPU is about to enter idle, it invokes newidle_balance() to pull some tasks from other runqueues. Although there is per domain max_newidle_lb_cost to throttle the newidle_balance(), it would be good to further limit the scan based on overall system utilization. The reason is that there is no limitation for newidle_balance() to launch this balance simultaneously on multiple CPUs. Since each newidle_balance() has to traverse all the CPUs to calculate the statistics one by one, this total time cost on newidle_balance() could be O(n^2). This is not good for performance or power saving. For example, sqlite has spent quite some time on newidle balance() on Intel Sapphire Rapids, which has 2 x 56C/112T =3D 224 CPUs: 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats Based on this observation, limit the scan depth of newidle_balance() by considering the utilization of the LLC domain. Let the number of scanned groups be a linear function of the utilization ratio: nr_groups_to_scan =3D nr_groups * (1 - util_ratio) Besides, save the total_load, total_capacity of the current sched domain in each periodic load balance. This statistic can be reused later by CPU_NEWLY_IDLE load balance if it quits the scan earlier. Introduce a sched feature ILB_UTIL to control this. Suggested-by: Tim Chen Signed-off-by: Chen Yu --- include/linux/sched/topology.h | 4 ++++ kernel/sched/fair.c | 34 ++++++++++++++++++++++++++++++++++ kernel/sched/features.h | 1 + 3 files changed, 39 insertions(+) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 1faececd5694..d7b2bac9bdf3 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -82,6 +82,10 @@ struct sched_domain_shared { atomic_t nr_busy_cpus; int has_idle_cores; int nr_idle_scan; + /* ilb scan depth and load balance statistic snapshot */ + int ilb_nr_scan; + unsigned long ilb_total_load; + unsigned long ilb_total_capacity; }; =20 struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b3a24aead848..f999e838114e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10122,6 +10122,39 @@ static void update_idle_cpu_scan(struct lb_env *en= v, WRITE_ONCE(sd_share->nr_idle_scan, (int)y); } =20 +static void update_ilb_group_scan(struct lb_env *env, + unsigned long sum_util, + struct sched_domain_shared *sd_share, + struct sd_lb_stats *sds) +{ + u64 tmp, nr_scan; + + if (!sched_feat(ILB_UTIL) || env->idle =3D=3D CPU_NEWLY_IDLE) + return; + + if (!sd_share) + return; + /* + * Limit the newidle balance scan depth based on overall system + * utilization: + * nr_groups_scan =3D nr_groups * (1 - util_ratio) + * and util_ratio =3D sum_util / (sd_weight * SCHED_CAPACITY_SCALE) + */ + nr_scan =3D env->sd->nr_groups * sum_util; + tmp =3D env->sd->span_weight * SCHED_CAPACITY_SCALE; + do_div(nr_scan, tmp); + nr_scan =3D env->sd->nr_groups - nr_scan; + if ((int)nr_scan !=3D sd_share->ilb_nr_scan) + WRITE_ONCE(sd_share->ilb_nr_scan, (int)nr_scan); + + /* Also save the statistic snapshot of the periodic load balance */ + if (sds->total_load !=3D sd_share->ilb_total_load) + WRITE_ONCE(sd_share->ilb_total_load, sds->total_load); + + if (sds->total_capacity !=3D sd_share->ilb_total_capacity) + WRITE_ONCE(sd_share->ilb_total_capacity, sds->total_capacity); +} + /** * update_sd_lb_stats - Update sched_domain's statistics for load balancin= g. * @env: The load balancing environment. @@ -10200,6 +10233,7 @@ static inline void update_sd_lb_stats(struct lb_env= *env, struct sd_lb_stats *sd } =20 update_idle_cpu_scan(env, sum_util, sd_share); + update_ilb_group_scan(env, sum_util, sd_share, sds); } =20 /** diff --git a/kernel/sched/features.h b/kernel/sched/features.h index ee7f23c76bd3..8f6e5b08408d 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -85,6 +85,7 @@ SCHED_FEAT(RT_PUSH_IPI, true) =20 SCHED_FEAT(RT_RUNTIME_SHARE, false) SCHED_FEAT(LB_MIN, false) +SCHED_FEAT(ILB_UTIL, true) SCHED_FEAT(ATTACH_AGE_LOAD, true) =20 SCHED_FEAT(WA_IDLE, true) --=20 2.25.1 From nobody Sat Feb 7 22:07:21 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D939AC7EE2E for ; Mon, 12 Jun 2023 08:24:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232116AbjFLIYZ (ORCPT ); Mon, 12 Jun 2023 04:24:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59422 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229956AbjFLIYD (ORCPT ); Mon, 12 Jun 2023 04:24:03 -0400 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6515DE67 for ; Mon, 12 Jun 2023 01:23:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686558229; x=1718094229; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=dZ3qa1vTh1///kMKUTog8YoVjT3GWLH4LbO4Sq0wlqk=; b=N0/1gttRtXtYQdezO0OLcbvm6JGhQbHx+M6q4wngJr6orO2az8k+SPfR ysMOODZFp2d3rLZXFVTIU18Km1m+vAIEFCpP6oUgeYkFP/Y0prZsFOKvq GKJAdI9ipqhKBVsBCCAAJD1QpfDeWwDR5ELnKIQCzL4INM9w0+Ug7l4GJ rcvsk5B/faYa4wiYHeCH8Q20r40P+FCYIPU4XsNzZr6XAq6ywWmAZSmyl 4kaGgh/auiIJb7F0ebrK7Xu9t7CoZguSlyTwgEijEWgFMrHBBAeQhWTsK hj9si/WLPqRxVwWuAuxSNcZZw5E2bRH3zEbxKD4rdJ0kZyQi23I+CBB97 A==; X-IronPort-AV: E=McAfee;i="6600,9927,10738"; a="421572595" X-IronPort-AV: E=Sophos;i="6.00,236,1681196400"; d="scan'208";a="421572595" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Jun 2023 01:23:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10738"; a="1041232251" X-IronPort-AV: E=Sophos;i="6.00,236,1681196400"; d="scan'208";a="1041232251" Received: from chenyu-dev.sh.intel.com ([10.239.62.164]) by fmsmga005.fm.intel.com with ESMTP; 12 Jun 2023 01:23:44 -0700 From: Chen Yu To: Peter Zijlstra , Vincent Guittot , Ingo Molnar , Juri Lelli Cc: Tim Chen , Mel Gorman , Dietmar Eggemann , K Prateek Nayak , Abel Wu , "Gautham R . Shenoy" , Len Brown , Chen Yu , Yicong Yang , linux-kernel@vger.kernel.org, Chen Yu , kernel test robot Subject: [RFC PATCH 4/4] sched/fair: Throttle the busiest group scanning in idle load balance Date: Tue, 13 Jun 2023 00:19:13 +0800 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Scanning the whole sched domain to find the busiest group is time costly during newidle_balance(). Limit the scan depth of newidle_balance() to only scan for a limited number of sched groups to find a relatively busy group, and pull from it. The scanning depth is suggested by the previous periodic load balance based on its overall utilization. Tested on top of sched/core v6.4-rc1, Sapphire Rapids with 2 x 56C/112T =3D 224 CPUs, cpufreq governor is set to performance, and C6 is disabled. Overall some improvements were noticed when the system is underloaded from tbench/netperf, and no noticeable difference from hackbench/schbench. And the percentage of newidle_balance() has dropped accordingly. [netperf] Launches $nr instances of: netperf -4 -H 127.0.0.1 -t $work_mode -c -C -l 100 & nr: 56, 112, 168, 224, 280, 336, 392, 448 work_mode: TCP_RR UDP_RR throughput =3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) TCP_RR 56-threads 1.00 ( 1.98) +18.45 ( 10.84) TCP_RR 112-threads 1.00 ( 3.79) +0.92 ( 4.72) TCP_RR 168-threads 1.00 ( 5.40) -0.09 ( 5.94) TCP_RR 224-threads 1.00 ( 42.40) -1.37 ( 41.42) TCP_RR 280-threads 1.00 ( 15.95) -0.30 ( 14.82) TCP_RR 336-threads 1.00 ( 27.84) -0.08 ( 28.91) TCP_RR 392-threads 1.00 ( 41.85) -0.56 ( 39.18) TCP_RR 448-threads 1.00 ( 45.95) +1.62 ( 52.54) UDP_RR 56-threads 1.00 ( 8.41) +4.86 ( 5.54) UDP_RR 112-threads 1.00 ( 9.11) -0.68 ( 9.92) UDP_RR 168-threads 1.00 ( 10.48) -0.15 ( 10.07) UDP_RR 224-threads 1.00 ( 40.98) -3.80 ( 40.01) UDP_RR 280-threads 1.00 ( 23.50) -0.53 ( 23.42) UDP_RR 336-threads 1.00 ( 35.87) +0.38 ( 33.43) UDP_RR 392-threads 1.00 ( 49.47) -0.27 ( 44.40) UDP_RR 448-threads 1.00 ( 53.09) +1.81 ( 59.98) [tbench] tbench -t 100 $job 127.0.0.1 job: 56, 112, 168, 224, 280, 336, 392, 448 throughput =3D=3D=3D=3D=3D=3D=3D=3D=3D loopback 56-threads 1.00 ( 1.12) +1.41 ( 0.43) loopback 112-threads 1.00 ( 0.43) +0.30 ( 0.73) loopback 168-threads 1.00 ( 6.88) -5.73 ( 7.74) loopback 224-threads 1.00 ( 12.99) +31.32 ( 0.22) loopback 280-threads 1.00 ( 0.38) -0.94 ( 0.99) loopback 336-threads 1.00 ( 0.13) +0.06 ( 0.18) loopback 392-threads 1.00 ( 0.06) -0.09 ( 0.16) loopback 448-threads 1.00 ( 0.10) -0.13 ( 0.18) [hackbench] hackbench -g $job --$work_type --pipe -l 200000 -s 100 -f 28 and hackbench -g $job --$work_type -l 200000 -s 100 -f 28 job: 1, 2, 4, 8 work_type: process threads throughput =3D=3D=3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) process-pipe 1-groups 1.00 ( 6.09) +2.61 ( 9.27) process-pipe 2-groups 1.00 ( 7.15) +6.22 ( 5.59) process-pipe 4-groups 1.00 ( 3.40) +2.01 ( 5.45) process-pipe 8-groups 1.00 ( 0.44) -1.57 ( 0.70) process-sockets 1-groups 1.00 ( 0.69) +0.86 ( 1.84) process-sockets 2-groups 1.00 ( 5.04) -6.31 ( 0.60) process-sockets 4-groups 1.00 ( 0.22) +0.01 ( 0.75) process-sockets 8-groups 1.00 ( 0.49) +0.46 ( 0.49) threads-pipe 1-groups 1.00 ( 1.96) -4.86 ( 6.90) threads-pipe 2-groups 1.00 ( 3.02) +0.21 ( 2.72) threads-pipe 4-groups 1.00 ( 4.83) -1.08 ( 6.41) threads-pipe 8-groups 1.00 ( 3.86) +4.19 ( 3.82) threads-sockets 1-groups 1.00 ( 2.20) +1.65 ( 1.85) threads-sockets 2-groups 1.00 ( 3.09) -0.36 ( 2.14) threads-sockets 4-groups 1.00 ( 0.99) -2.54 ( 1.86) threads-sockets 8-groups 1.00 ( 0.27) -0.01 ( 0.79) [schbench] schbench -m $job -t 56 -r 30 job: 1, 2, 4, 8 3 iterations 99.0th latency =3D=3D=3D=3D=3D=3D=3D=3D case load baseline(std%) compare%( std%) normal 1-mthreads 1.00 ( 1.10) +0.88 ( 0.84) normal 2-mthreads 1.00 ( 0.73) +0.00 ( 0.73) normal 4-mthreads 1.00 ( 1.46) -0.60 ( 2.74) normal 8-mthreads 1.00 ( 4.09) +1.08 ( 4.60) Suggested-by: Tim Chen Reported-by: kernel test robot Signed-off-by: Chen Yu --- kernel/sched/fair.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f999e838114e..272e6c224b96 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10168,7 +10168,12 @@ static inline void update_sd_lb_stats(struct lb_en= v *env, struct sd_lb_stats *sd struct sg_lb_stats *local =3D &sds->local_stat; struct sg_lb_stats tmp_sgs; unsigned long sum_util =3D 0; - int sg_status =3D 0; + int sg_status =3D 0, nr_scan_ilb; + bool ilb_util_enabled =3D sched_feat(ILB_UTIL) && env->idle =3D=3D CPU_NE= WLY_IDLE && + sd_share && READ_ONCE(sd_share->ilb_total_capacity); + + if (ilb_util_enabled) + nr_scan_ilb =3D sd_share->ilb_nr_scan; =20 do { struct sg_lb_stats *sgs =3D &tmp_sgs; @@ -10186,6 +10191,14 @@ static inline void update_sd_lb_stats(struct lb_en= v *env, struct sd_lb_stats *sd =20 update_sg_lb_stats(env, sds, sg, sgs, &sg_status); =20 + if (ilb_util_enabled && --nr_scan_ilb <=3D 0) { + /* borrow the statistic of previous periodic load balance */ + sds->total_load =3D READ_ONCE(sd_share->ilb_total_load); + sds->total_capacity =3D READ_ONCE(sd_share->ilb_total_capacity); + + break; + } + if (local_group) goto next_group; =20 --=20 2.25.1