From nobody Wed Feb 11 10:38:05 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 19E52EB64DA for ; Fri, 7 Jul 2023 22:57:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232113AbjGGW47 (ORCPT ); Fri, 7 Jul 2023 18:56:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60484 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230414AbjGGW4z (ORCPT ); Fri, 7 Jul 2023 18:56:55 -0400 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 678721999 for ; Fri, 7 Jul 2023 15:56:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1688770614; x=1720306614; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6HBQc6gDrsn14hkdM0BGPFn3U8cheq47FEDyafZnmmY=; b=JuKPwUkoPviVvlFnNkKNkLljKSj/CIDCkliHkHeT562IhDA5quZyZ227 fp7PIC0ZGh3H2r3Fmh5x+Ma1ACcXpMHby6GYOSs1GYztnH8nOq5B5ge+4 LHzp2GcfnWjanBRSLMzdr2Nbam1QifEtoek7IGBSqeURmAGs4OFTP06RT uTH23aIX+69W6AvZqjBWDvDwhC2EdVxRiGuj0vQJrRPQyKI+BiL3BuiOt ptLJwNq84gMhaLER9bkRJ7Xl7i3vQay4yHENo0ovc0NzCBdcNzQbk7rPT QTWNUpSaS/32565GtWM4m8hxUnSusk/PoqjxJPht468PbjWk+eCDO+hca Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="427683440" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="427683440" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jul 2023 15:56:53 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="714176657" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="714176657" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orsmga007.jf.intel.com with ESMTP; 07 Jul 2023 15:56:53 -0700 From: Tim Chen To: Peter Zijlstra Cc: Tim C Chen , Juri Lelli , Vincent Guittot , Ricardo Neri , "Ravi V . Shankar" , Ben Segall , Daniel Bristot de Oliveira , Dietmar Eggemann , Len Brown , Mel Gorman , "Rafael J . Wysocki" , Srinivas Pandruvada , Steven Rostedt , Valentin Schneider , Ionela Voinescu , x86@kernel.org, linux-kernel@vger.kernel.org, Shrikanth Hegde , Srikar Dronamraju , naveen.n.rao@linux.vnet.ibm.com, Yicong Yang , Barry Song , Chen Yu , Hillf Danton Subject: [Patch v3 1/6] sched/fair: Determine active load balance for SMT sched groups Date: Fri, 7 Jul 2023 15:57:00 -0700 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Tim C Chen On hybrid CPUs with scheduling cluster enabled, we will need to consider balancing between SMT CPU cluster, and Atom core cluster. Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores. Each scheduling cluster span a L2 cache. --L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------ [0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14] Big Big Big Big Atom Atom core core core core Module Module If the busiest group is a big core with both SMT CPUs busy, we should active load balance if destination group has idle CPU cores. Such condition is considered by asym_active_balance() in load balancing but not considered when looking for busiest group and computing load imbalance. Add this consideration in find_busiest_group() and calculate_imbalance(). In addition, update the logic determining the busier group when one group is SMT and the other group is non SMT but both groups are partially busy with idle CPU. The busier group should be the group with idle cores rather than the group with one busy SMT CPU. We do not want to make the SMT group the busiest one to pull the only task off SMT CPU and causing the whole cor= e to go empty. Otherwise suppose in the search for the busiest group, we first encounter an SMT group with 1 task and set it as the busiest. The destination group is an atom cluster with 1 task and we next encounter an atom cluster group with 3 tasks, we will not pick this atom cluster over the SMT group, even though we should. As a result, we do not load balance the busier Atom cluster (with 3 tasks) towards the local atom cluster (with 1 task). And it doesn't make sense to pick the 1 task SMT group as the busier group as we also should not pull task off the SMT towards the 1 task atom cluster and make the SMT core completely empty. Signed-off-by: Tim Chen --- kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 77 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 87317634fab2..f636d6c09dc6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8279,6 +8279,11 @@ enum group_type { * more powerful CPU. */ group_misfit_task, + /* + * Balance SMT group that's fully busy. Can benefit from migration + * a task on SMT with busy sibling to another CPU on idle core. + */ + group_smt_balance, /* * SD_ASYM_PACKING only: One local CPU with higher capacity is available, * and the task should be migrated to it instead of running on the @@ -8987,6 +8992,7 @@ struct sg_lb_stats { unsigned int group_weight; enum group_type group_type; unsigned int group_asym_packing; /* Tasks should be moved to preferred CP= U */ + unsigned int group_smt_balance; /* Task on busy SMT be moved */ unsigned long group_misfit_task_load; /* A CPU has a task too big for its= capacity */ #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; @@ -9260,6 +9266,9 @@ group_type group_classify(unsigned int imbalance_pct, if (sgs->group_asym_packing) return group_asym_packing; =20 + if (sgs->group_smt_balance) + return group_smt_balance; + if (sgs->group_misfit_task_load) return group_misfit_task; =20 @@ -9333,6 +9342,36 @@ sched_asym(struct lb_env *env, struct sd_lb_stats *s= ds, struct sg_lb_stats *sgs return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu); } =20 +/* One group has more than one SMT CPU while the other group does not */ +static inline bool smt_vs_nonsmt_groups(struct sched_group *sg1, + struct sched_group *sg2) +{ + if (!sg1 || !sg2) + return false; + + return (sg1->flags & SD_SHARE_CPUCAPACITY) !=3D + (sg2->flags & SD_SHARE_CPUCAPACITY); +} + +static inline bool smt_balance(struct lb_env *env, struct sg_lb_stats *sgs, + struct sched_group *group) +{ + if (env->idle =3D=3D CPU_NOT_IDLE) + return false; + + /* + * For SMT source group, it is better to move a task + * to a CPU that doesn't have multiple tasks sharing its CPU capacity. + * Note that if a group has a single SMT, SD_SHARE_CPUCAPACITY + * will not be on. + */ + if (group->flags & SD_SHARE_CPUCAPACITY && + sgs->sum_h_nr_running > 1) + return true; + + return false; +} + static inline bool sched_reduced_capacity(struct rq *rq, struct sched_domain *sd) { @@ -9425,6 +9464,10 @@ static inline void update_sg_lb_stats(struct lb_env = *env, sgs->group_asym_packing =3D 1; } =20 + /* Check for loaded SMT group to be balanced to dst CPU */ + if (!local_group && smt_balance(env, sgs, group)) + sgs->group_smt_balance =3D 1; + sgs->group_type =3D group_classify(env->sd->imbalance_pct, group, sgs); =20 /* Computing avg_load makes sense only when group is overloaded */ @@ -9509,6 +9552,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, return false; break; =20 + case group_smt_balance: case group_fully_busy: /* * Select the fully busy group with highest avg_load. In @@ -9537,6 +9581,18 @@ static bool update_sd_pick_busiest(struct lb_env *en= v, break; =20 case group_has_spare: + /* + * Do not pick sg with SMT CPUs over sg with pure CPUs, + * as we do not want to pull task off SMT core with one task + * and make the core idle. + */ + if (smt_vs_nonsmt_groups(sds->busiest, sg)) { + if (sg->flags & SD_SHARE_CPUCAPACITY && sgs->sum_h_nr_running <=3D 1) + return false; + else + return true; + } + /* * Select not overloaded group with lowest number of idle cpus * and highest number of running tasks. We could also compare @@ -9733,6 +9789,7 @@ static bool update_pick_idlest(struct sched_group *id= lest, =20 case group_imbalanced: case group_asym_packing: + case group_smt_balance: /* Those types are not used in the slow wakeup path */ return false; =20 @@ -9864,6 +9921,7 @@ find_idlest_group(struct sched_domain *sd, struct tas= k_struct *p, int this_cpu) =20 case group_imbalanced: case group_asym_packing: + case group_smt_balance: /* Those type are not used in the slow wakeup path */ return NULL; =20 @@ -10118,6 +10176,13 @@ static inline void calculate_imbalance(struct lb_e= nv *env, struct sd_lb_stats *s return; } =20 + if (busiest->group_type =3D=3D group_smt_balance) { + /* Reduce number of tasks sharing CPU capacity */ + env->migration_type =3D migrate_task; + env->imbalance =3D 1; + return; + } + if (busiest->group_type =3D=3D group_imbalanced) { /* * In the group_imb case we cannot rely on group-wide averages @@ -10363,16 +10428,23 @@ static struct sched_group *find_busiest_group(str= uct lb_env *env) goto force_balance; =20 if (busiest->group_type !=3D group_overloaded) { - if (env->idle =3D=3D CPU_NOT_IDLE) + if (env->idle =3D=3D CPU_NOT_IDLE) { /* * If the busiest group is not overloaded (and as a * result the local one too) but this CPU is already * busy, let another idle CPU try to pull task. */ goto out_balanced; + } + + if (busiest->group_type =3D=3D group_smt_balance && + smt_vs_nonsmt_groups(sds.local, sds.busiest)) { + /* Let non SMT CPU pull from SMT CPU sharing with sibling */ + goto force_balance; + } =20 if (busiest->group_weight > 1 && - local->idle_cpus <=3D (busiest->idle_cpus + 1)) + local->idle_cpus <=3D (busiest->idle_cpus + 1)) { /* * If the busiest group is not overloaded * and there is no imbalance between this and busiest @@ -10383,12 +10455,14 @@ static struct sched_group *find_busiest_group(str= uct lb_env *env) * there is more than 1 CPU per group. */ goto out_balanced; + } =20 - if (busiest->sum_h_nr_running =3D=3D 1) + if (busiest->sum_h_nr_running =3D=3D 1) { /* * busiest doesn't have any tasks waiting to run */ goto out_balanced; + } } =20 force_balance: --=20 2.32.0 From nobody Wed Feb 11 10:38:05 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 29BA6EB64D9 for ; Fri, 7 Jul 2023 22:57:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229969AbjGGW5G (ORCPT ); Fri, 7 Jul 2023 18:57:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60486 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231245AbjGGW4z (ORCPT ); Fri, 7 Jul 2023 18:56:55 -0400 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C2DCC1997 for ; Fri, 7 Jul 2023 15:56:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1688770614; x=1720306614; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=GtHT20V983sr8n3O3e7GEiqoyfK5vMEHBMPj3byjlrM=; b=DW44dRz7PP7qXGzmS0qu36AVxPOX/tWHCYhmHNNhwBvXgo02ykRyF853 +KJ0gjLdLL7cT9Gx4OXQfcNWcWvTXRLdh3BhgG4+UleE7IbUmb2nYmfcE HvF0z40i1c/wwagL/P1FqKleyPzyznrO7VkgKy6qsUhGbRBIJJURQ2pXY 9jSX97WBjxejfoHbgBZfKVzgMovp82Ia1GKrvvleuvOBSgui43EoBjZt1 GuIhAhZxFhX8FBBvO2CnGNDlBQAcuZ+6JaF0iM+3RXbFllfSOS6UizjGP mC651krSCGxP0tjfX3mhjg8nCx22KmGcvt9ZyJfyf31wowwmTkywVbYXC w==; X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="427683460" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="427683460" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jul 2023 15:56:54 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="714176664" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="714176664" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orsmga007.jf.intel.com with ESMTP; 07 Jul 2023 15:56:54 -0700 From: Tim Chen To: Peter Zijlstra Cc: Tim C Chen , Juri Lelli , Vincent Guittot , Ricardo Neri , "Ravi V . Shankar" , Ben Segall , Daniel Bristot de Oliveira , Dietmar Eggemann , Len Brown , Mel Gorman , "Rafael J . Wysocki" , Srinivas Pandruvada , Steven Rostedt , Valentin Schneider , Ionela Voinescu , x86@kernel.org, linux-kernel@vger.kernel.org, Shrikanth Hegde , Srikar Dronamraju , naveen.n.rao@linux.vnet.ibm.com, Yicong Yang , Barry Song , Chen Yu , Hillf Danton Subject: [Patch v3 2/6] sched/topology: Record number of cores in sched group Date: Fri, 7 Jul 2023 15:57:01 -0700 Message-Id: <04641eeb0e95c21224352f5743ecb93dfac44654.1688770494.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Tim C Chen When balancing sibling domains that have different number of cores, tasks in respective sibling domain should be proportional to the number of cores in each domain. In preparation of implementing such a policy, record the number of tasks in a scheduling group. Signed-off-by: Tim Chen Reviewed-by: Valentin Schneider --- kernel/sched/sched.h | 1 + kernel/sched/topology.c | 10 +++++++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3d0eb36350d2..5f7f36e45b87 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1860,6 +1860,7 @@ struct sched_group { atomic_t ref; =20 unsigned int group_weight; + unsigned int cores; struct sched_group_capacity *sgc; int asym_prefer_cpu; /* CPU of highest priority in group */ int flags; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 6d5628fcebcf..6b099dbdfb39 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1275,14 +1275,22 @@ build_sched_groups(struct sched_domain *sd, int cpu) static void init_sched_groups_capacity(int cpu, struct sched_domain *sd) { struct sched_group *sg =3D sd->groups; + struct cpumask *mask =3D sched_domains_tmpmask2; =20 WARN_ON(!sg); =20 do { - int cpu, max_cpu =3D -1; + int cpu, cores =3D 0, max_cpu =3D -1; =20 sg->group_weight =3D cpumask_weight(sched_group_span(sg)); =20 + cpumask_copy(mask, sched_group_span(sg)); + for_each_cpu(cpu, mask) { + cores++; + cpumask_andnot(mask, mask, cpu_smt_mask(cpu)); + } + sg->cores =3D cores; + if (!(sd->flags & SD_ASYM_PACKING)) goto next; =20 --=20 2.32.0 From nobody Wed Feb 11 10:38:05 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E8B21EB64DA for ; Fri, 7 Jul 2023 22:57:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230040AbjGGW5J (ORCPT ); Fri, 7 Jul 2023 18:57:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60492 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231617AbjGGW44 (ORCPT ); Fri, 7 Jul 2023 18:56:56 -0400 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F17611992 for ; Fri, 7 Jul 2023 15:56:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1688770615; x=1720306615; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=FpAHkI8xLSWqEP7PVVGD4WuKrAVFt/AYDCvhVA2Njec=; b=YNpKETcjCbhcRR9uwK34ZQCicxfZrrR9/UvFNZs3uChyx4vqqf9V7qbC kijiWvwgWePN/eWnw+aN0aVmufoVccOR8eEZN/gt1s5Y+GhTPfiMUabLZ 2hGvnMsbJUfoB7ONab8LU/TIjnd+qo+E0xDL8r1g2wC5jSummbgk4aNFT MpQealJrJwZCQkyEhx79wBmZQv9CMoNoMmOL4qALAbgTM+79hP9moD/li EX0RasOpts7MDtTkjZW0kkwCMih4vyyPLsDEnAvaN5leDtk3w/K5gQMPB NOwYW1i7DLccCfWa9Oa0sN+gsyVNZ9zDobNhKIgpQjvEEdFERMrk5gwl0 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="427683479" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="427683479" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jul 2023 15:56:55 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="714176670" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="714176670" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orsmga007.jf.intel.com with ESMTP; 07 Jul 2023 15:56:55 -0700 From: Tim Chen To: Peter Zijlstra Cc: Tim C Chen , Juri Lelli , Vincent Guittot , Ricardo Neri , "Ravi V . Shankar" , Ben Segall , Daniel Bristot de Oliveira , Dietmar Eggemann , Len Brown , Mel Gorman , "Rafael J . Wysocki" , Srinivas Pandruvada , Steven Rostedt , Valentin Schneider , Ionela Voinescu , x86@kernel.org, linux-kernel@vger.kernel.org, Shrikanth Hegde , Srikar Dronamraju , naveen.n.rao@linux.vnet.ibm.com, Yicong Yang , Barry Song , Chen Yu , Hillf Danton Subject: [Patch v3 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups Date: Fri, 7 Jul 2023 15:57:02 -0700 Message-Id: <4eacbaa236e680687dae2958378a6173654113df.1688770494.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Tim C Chen In the current prefer sibling load balancing code, there is an implicit assumption that the busiest sched group and local sched group are equivalent, hence the tasks to be moved is simply the difference in number of tasks between the two groups (i.e. imbalance) divided by two. However, we may have different number of cores between the cluster groups, say when we take CPU offline or we have hybrid groups. In that case, we should balance between the two groups such that #tasks/#cores ratio is the same between the same between both groups. Hence the imbalance computed will need to reflect this. Adjust the sibling imbalance computation to take into account of the above considerations. Signed-off-by: Tim Chen --- kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++---- 1 file changed, 37 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f636d6c09dc6..f491b94908bf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9372,6 +9372,41 @@ static inline bool smt_balance(struct lb_env *env, s= truct sg_lb_stats *sgs, return false; } =20 +static inline long sibling_imbalance(struct lb_env *env, + struct sd_lb_stats *sds, + struct sg_lb_stats *busiest, + struct sg_lb_stats *local) +{ + int ncores_busiest, ncores_local; + long imbalance; + + if (env->idle =3D=3D CPU_NOT_IDLE || !busiest->sum_nr_running) + return 0; + + ncores_busiest =3D sds->busiest->cores; + ncores_local =3D sds->local->cores; + + if (ncores_busiest =3D=3D ncores_local) { + imbalance =3D busiest->sum_nr_running; + lsub_positive(&imbalance, local->sum_nr_running); + return imbalance; + } + + /* Balance such that nr_running/ncores ratio are same on both groups */ + imbalance =3D ncores_local * busiest->sum_nr_running; + lsub_positive(&imbalance, ncores_busiest * local->sum_nr_running); + /* Normalize imbalance and do rounding on normalization */ + imbalance =3D 2 * imbalance + ncores_local + ncores_busiest; + imbalance /=3D ncores_local + ncores_busiest; + + /* Take advantage of resource in an empty sched group */ + if (imbalance =3D=3D 0 && local->sum_nr_running =3D=3D 0 && + busiest->sum_nr_running > 1) + imbalance =3D 2; + + return imbalance; +} + static inline bool sched_reduced_capacity(struct rq *rq, struct sched_domain *sd) { @@ -10230,14 +10265,12 @@ static inline void calculate_imbalance(struct lb_= env *env, struct sd_lb_stats *s } =20 if (busiest->group_weight =3D=3D 1 || sds->prefer_sibling) { - unsigned int nr_diff =3D busiest->sum_nr_running; /* * When prefer sibling, evenly spread running tasks on * groups. */ env->migration_type =3D migrate_task; - lsub_positive(&nr_diff, local->sum_nr_running); - env->imbalance =3D nr_diff; + env->imbalance =3D sibling_imbalance(env, sds, busiest, local); } else { =20 /* @@ -10424,7 +10457,7 @@ static struct sched_group *find_busiest_group(struc= t lb_env *env) * group's child domain. */ if (sds.prefer_sibling && local->group_type =3D=3D group_has_spare && - busiest->sum_nr_running > local->sum_nr_running + 1) + sibling_imbalance(env, &sds, busiest, local) > 1) goto force_balance; =20 if (busiest->group_type !=3D group_overloaded) { --=20 2.32.0 From nobody Wed Feb 11 10:38:05 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D0DAEB64DA for ; Fri, 7 Jul 2023 22:57:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232526AbjGGW5M (ORCPT ); Fri, 7 Jul 2023 18:57:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60502 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229773AbjGGW45 (ORCPT ); Fri, 7 Jul 2023 18:56:57 -0400 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 01FDB1997 for ; Fri, 7 Jul 2023 15:56:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1688770616; x=1720306616; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=5zfJiUCxKx+vvs3EQDbGAx5BcCZrMXvlpEj7Tybxxd8=; b=RPNliaAUxakcnm1hZc90AvVq+VEx2qAAeCE21roAL/8WcWRXnfDcmuUj hRnkV/MJOBRyP3TJOL9f5qf3NzTEag2VohqyrSv/BXx/Ugrq2mjrErAx5 PWrvk2CSCn4hdth7c75NYumyjkARMADJEjxzQJFt+Avdjqg4YJgaZtpUI YGvKxOXwHJ1R35sunOQCGPm61+2X/VhHGsczeVKHojRQuvbuzW/yvMjtt AIaQs+GIkypSGJWfsYQg0vNC2Y3w6ldqx9O5s3KuK8Fack1w5LfukAVQm K8saeyuF47VQhzpw414mzCkBN1D6kmZBqrwCc1q7xFNFIT0wsHRe0fkH9 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="427683492" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="427683492" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jul 2023 15:56:56 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="714176674" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="714176674" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orsmga007.jf.intel.com with ESMTP; 07 Jul 2023 15:56:56 -0700 From: Tim Chen To: Peter Zijlstra Cc: Ricardo Neri , Juri Lelli , Vincent Guittot , Ricardo Neri , "Ravi V . Shankar" , Ben Segall , Daniel Bristot de Oliveira , Dietmar Eggemann , Len Brown , Mel Gorman , "Rafael J . Wysocki" , Srinivas Pandruvada , Steven Rostedt , Tim Chen , Valentin Schneider , Ionela Voinescu , x86@kernel.org, linux-kernel@vger.kernel.org, Shrikanth Hegde , Srikar Dronamraju , naveen.n.rao@linux.vnet.ibm.com, Yicong Yang , Barry Song , Chen Yu , Hillf Danton Subject: [Patch v3 4/6] sched/fair: Consider the idle state of the whole core for load balance Date: Fri, 7 Jul 2023 15:57:03 -0700 Message-Id: <807bdd05331378ea3bf5956bda87ded1036ba769.1688770494.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Ricardo Neri should_we_balance() traverses the group_balance_mask (AND'ed with lb_env:: cpus) starting from lower numbered CPUs looking for the first idle CPU. In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before non-SMT cores: [0, 1] [2, 3] [4, 5] 6 7 8 9 b i b i b i b i i i In the figure above, CPUs in brackets are siblings of an SMT core. The rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an idle CPU. We should let a CPU on a fully idle core get the first chance to idle load balance as it has more CPU capacity than a CPU on an idle SMT CPU with busy sibling. So for the figure above, if we are running should_we_balance() to CPU 1, we should return false to let CPU 7 on idle core to have a chance first to idle load balance. A partially busy (i.e., of type group_has_spare) local group with SMT=C2=A0 cores will often have only one SMT sibling busy.=C2=A0If the destination CPU is a non-SMT core, partially busy, lower-numbered, SMT cores should not be considered when finding the first idle CPU.=C2=A0 However, in should_we_balance(), when we encounter idle SMT first in partia= lly busy core, we prematurely break the search for the first idle CPU. Higher-numbered, non-SMT cores is not given the chance to have idle balance done on their behalf. Those CPUs will only be considered for idle balancing by chance via CPU_NEWLY_IDLE. Instead, consider the idle state of the whole SMT core. Signed-off-by: Ricardo Neri Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- kernel/sched/fair.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f491b94908bf..294a662c9410 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10729,7 +10729,7 @@ static int active_load_balance_cpu_stop(void *data); static int should_we_balance(struct lb_env *env) { struct sched_group *sg =3D env->sd->groups; - int cpu; + int cpu, idle_smt =3D -1; =20 /* * Ensure the balancing environment is consistent; can happen @@ -10756,10 +10756,24 @@ static int should_we_balance(struct lb_env *env) if (!idle_cpu(cpu)) continue; =20 + /* + * Don't balance to idle SMT in busy core right away when + * balancing cores, but remember the first idle SMT CPU for + * later consideration. Find CPU on an idle core first. + */ + if (!(env->sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) { + if (idle_smt =3D=3D -1) + idle_smt =3D cpu; + continue; + } + /* Are we the first idle CPU? */ return cpu =3D=3D env->dst_cpu; } =20 + if (idle_smt =3D=3D env->dst_cpu) + return true; + /* Are we the first CPU of this group ? */ return group_balance_cpu(sg) =3D=3D env->dst_cpu; } --=20 2.32.0 From nobody Wed Feb 11 10:38:05 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9337C001B0 for ; Fri, 7 Jul 2023 22:57:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232666AbjGGW5P (ORCPT ); Fri, 7 Jul 2023 18:57:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60504 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231799AbjGGW46 (ORCPT ); Fri, 7 Jul 2023 18:56:58 -0400 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A0F011992 for ; Fri, 7 Jul 2023 15:56:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1688770617; x=1720306617; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=40RBK7UEzMfh6s+/yIJhkN9+deAVJi5T19wOPusP9aM=; b=OBYCgSmV8jNink+T4OFXEL4Qi4vKbSwNyZuOHGmpKgNohAztcWOzkbLJ QSyeQPjeLDh4+2RoTMd3/skmo9e2Oq1KRck+8pWfLu9qriFwh/X6GDkap l7Z9Iwap6xc5HvSgbOtCGWmD7vNoL7v0LARkkLM3hG21mFj8g6Jnk1U+3 i4/dKeHs28Cuan0Gog0tEBf6FhPQ70KyGQ6JoYY040imVIClEFwqGlgft QUrTW9zZN+cyjrOm8q7moeL6Iq2i8kVJEZ9++Nxn3jdb8ZnQBYR7qVPwQ IS5bhvtbMrEDRTozvaYMswpLBw6JtF0/BILB2mjZBZ03vvRB5etr8QITE Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="427683505" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="427683505" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jul 2023 15:56:57 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="714176681" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="714176681" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orsmga007.jf.intel.com with ESMTP; 07 Jul 2023 15:56:57 -0700 From: Tim Chen To: Peter Zijlstra Cc: Tim C Chen , Juri Lelli , Vincent Guittot , Ricardo Neri , "Ravi V . Shankar" , Ben Segall , Daniel Bristot de Oliveira , Dietmar Eggemann , Len Brown , Mel Gorman , "Rafael J . Wysocki" , Srinivas Pandruvada , Steven Rostedt , Valentin Schneider , Ionela Voinescu , x86@kernel.org, linux-kernel@vger.kernel.org, Shrikanth Hegde , Srikar Dronamraju , naveen.n.rao@linux.vnet.ibm.com, Yicong Yang , Barry Song , Chen Yu , Hillf Danton , Ricardo Neri Subject: [Patch v3 5/6] sched/x86: Add cluster topology to hybrid CPU Date: Fri, 7 Jul 2023 15:57:04 -0700 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Tim C Chen Cluster topology was not enabled on hybrid x86 CPU as load balance was not properly working for cluster domain. That has been fixed and cluster domain can be enabled for hybrid CPU. Reviewed-by: Ricardo Neri Signed-off-by: Tim Chen --- arch/x86/kernel/smpboot.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index cea297d97034..2489d767c398 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -575,6 +575,9 @@ static struct sched_domain_topology_level x86_hybrid_to= pology[] =3D { #ifdef CONFIG_SCHED_SMT { cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) }, #endif +#ifdef CONFIG_SCHED_CLUSTER + { cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) }, +#endif #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) }, #endif --=20 2.32.0 From nobody Wed Feb 11 10:38:05 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7B188EB64D9 for ; Fri, 7 Jul 2023 22:57:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229557AbjGGW5T (ORCPT ); Fri, 7 Jul 2023 18:57:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60520 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232010AbjGGW47 (ORCPT ); Fri, 7 Jul 2023 18:56:59 -0400 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0088419B7 for ; Fri, 7 Jul 2023 15:56:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1688770618; x=1720306618; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=862H8iEOdr8bBhkOKfZF+tLzXIXXbB5+ndLrLrloY0I=; b=gSTxPZ7sK6/tTXb4V1TnhFoIl+9cKmjAKlGUF/FyHvrJ66C35qHoYfle aBnDhCmw0MoWYmZdu6P2nK3GJk+7hYllOw0Ubf3qA0yZosb2wAfLOFddP 0T2XYFnWrp8ToT0WInUzyn6QpSK3IxJaQIuRi5sbpg96cLVA3143lLvJu vmUcdi/oLvJf7gMYU6dgp1deRuyBzmYyW+wt360rGSUXPbxm79B2LNpkW qjBo2/fli9UVw9TG7dEmagP93W/y4LXAFzoBcJ4vzBvrcWDU8Dqay372N cy9vzyjHoH6Lk0gs2ZgIBpFTZepMJtMtfNgTnaVz44okQtHSXXAG8hqtk w==; X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="427683523" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="427683523" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jul 2023 15:56:58 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10764"; a="714176687" X-IronPort-AV: E=Sophos;i="6.01,189,1684825200"; d="scan'208";a="714176687" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orsmga007.jf.intel.com with ESMTP; 07 Jul 2023 15:56:58 -0700 From: Tim Chen To: Peter Zijlstra Cc: Juri Lelli , Vincent Guittot , Ricardo Neri , "Ravi V . Shankar" , Ben Segall , Daniel Bristot de Oliveira , Dietmar Eggemann , Len Brown , Mel Gorman , "Rafael J . Wysocki" , Srinivas Pandruvada , Steven Rostedt , Tim Chen , Valentin Schneider , Ionela Voinescu , x86@kernel.org, linux-kernel@vger.kernel.org, Shrikanth Hegde , Srikar Dronamraju , naveen.n.rao@linux.vnet.ibm.com, Yicong Yang , Barry Song , Chen Yu , Hillf Danton Subject: [Patch v3 6/6] sched/debug: Dump domains' sched group flags Date: Fri, 7 Jul 2023 15:57:05 -0700 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: "Peter Zijlstra (Intel)" There have been a case where the SD_SHARE_CPUCAPACITY sched group flag in a parent domain were not set and propagated properly when a degenerate domain is removed. Add dump of domain sched group flags of a CPU to make debug easier in the future. Usage: cat /debug/sched/domains/cpu0/domain1/groups_flags to dump cpu0 domain1's sched group flags. Signed-off-by: Tim Chen --- kernel/sched/debug.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 1637b65ba07a..55b50f940feb 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -389,6 +389,7 @@ static void register_sd(struct sched_domain *sd, struct= dentry *parent) #undef SDM =20 debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops); + debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd= _flags_fops); } =20 void update_sched_domain_debugfs(void) --=20 2.32.0