From nobody Sun Sep 7 17:10:00 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BB21EB64DA for ; Thu, 20 Jul 2023 09:00:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231745AbjGTJAU (ORCPT ); Thu, 20 Jul 2023 05:00:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36000 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231971AbjGTIe3 (ORCPT ); Thu, 20 Jul 2023 04:34:29 -0400 Received: from out30-119.freemail.mail.aliyun.com (out30-119.freemail.mail.aliyun.com [115.124.30.119]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 75FB5268E for ; Thu, 20 Jul 2023 01:34:27 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R191e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046060;MF=kenan.liu@linux.alibaba.com;NM=1;PH=DS;RN=12;SR=0;TI=SMTPD_---0VnpHxp-_1689842063; Received: from iZbp125ew08a9bxe5bn4s4Z.localdomain(mailfrom:Kenan.Liu@linux.alibaba.com fp:SMTPD_---0VnpHxp-_1689842063) by smtp.aliyun-inc.com; Thu, 20 Jul 2023 16:34:24 +0800 From: "Kenan.Liu" To: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, vschneid@redhat.com Cc: luoben@linux.alibaba.com, linux-kernel@vger.kernel.org Subject: [RFC PATCH 1/2] sched/fair: Adjust CFS loadbalance for machine with qemu native CPU topology. Date: Thu, 20 Jul 2023 16:34:12 +0800 Message-Id: <1689842053-5291-2-git-send-email-Kenan.Liu@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1689842053-5291-1-git-send-email-Kenan.Liu@linux.alibaba.com> References: <1689842053-5291-1-git-send-email-Kenan.Liu@linux.alibaba.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" From: "Kenan.Liu" Multithreading workloads in VM with Qemu may encounter an unexpected phenomenon: one hyperthread of a physical core is busy while its sibling is idle. The main reason is that hyperthread index is consecutive in qemu native x86 CPU model which is different from the physical topology. As the current kernel scheduler implementation, hyperthread with an even ID number will be picked up in a much higher probability during load-balancing and load-deploying. To solve the imbalance, when on a machine with multi core and hyperthread index is consecutive per core, change the result of select_idle_core() according to the hyperthread on which the task ran before. Signed-off-by: Kenan.Liu Signed-off-by: Ben Luo --- kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a80a739..ad7c93f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -125,6 +125,9 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity =3D 1000000= UL; =20 const_debug unsigned int sysctl_sched_migration_cost =3D 500000UL; +static bool smt_neighbour_topo; +static bool core_smt_topo_detect; +static unsigned int smt_nr_cpu =3D 2; =20 int sched_thermal_decay_shift; static int __init setup_sched_thermal_decay_shift(char *str) @@ -140,6 +143,26 @@ static int __init setup_sched_thermal_decay_shift(char= *str) __setup("sched_thermal_decay_shift=3D", setup_sched_thermal_decay_shift); =20 #ifdef CONFIG_SMP +static void explore_core_smp_topology(void) +{ + int cpu =3D smp_processor_id(), sibling; + const struct cpumask *smt_mask =3D cpu_smt_mask(cpu); + + if (nr_cpu_ids <=3D 2) + return; + + smt_nr_cpu =3D cpumask_weight(smt_mask); + if (smt_nr_cpu < 2) + return; + + for_each_cpu(sibling, cpu_smt_mask(cpu)) { + if (cpu =3D=3D sibling) + continue; + if (abs(cpu - sibling) =3D=3D 1) + smt_neighbour_topo =3D true; + } +} + /* * For asym packing, by default the lower numbered CPU has higher priority. */ @@ -6887,9 +6910,16 @@ void __update_idle_core(struct rq *rq) static int select_idle_core(struct task_struct *p, int core, struct cpumas= k *cpus, int *idle_cpu) { bool idle =3D true; - int cpu; + int cpu, sibling =3D core; + + if (!core_smt_topo_detect) { + explore_core_smp_topology(); + core_smt_topo_detect =3D true; + } =20 for_each_cpu(cpu, cpu_smt_mask(core)) { + if (cpu !=3D core) + sibling =3D cpu; if (!available_idle_cpu(cpu)) { idle =3D false; if (*idle_cpu =3D=3D -1) { @@ -6905,8 +6935,12 @@ static int select_idle_core(struct task_struct *p, i= nt core, struct cpumask *cpu *idle_cpu =3D cpu; } =20 - if (idle) + if (idle) { + if (!smt_neighbour_topo || unlikely(core % smt_nr_cpu)) + return core; + core =3D task_cpu(p) % smt_nr_cpu ? core : sibling; return core; + } =20 cpumask_andnot(cpus, cpus, cpu_smt_mask(core)); return -1; --=20 1.8.3.1