From nobody Fri Dec 19 06:33:48 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C17FDEE49AF for ; Tue, 22 Aug 2023 11:31:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234762AbjHVLbM (ORCPT ); Tue, 22 Aug 2023 07:31:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47428 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234749AbjHVLbE (ORCPT ); Tue, 22 Aug 2023 07:31:04 -0400 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 72ACACD2 for ; Tue, 22 Aug 2023 04:31:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1692703859; bh=KkLV16+WkRX4ihKG1MjBEkAMOPnIqDOadaG9tU3jWvs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ka3F30c7TJBavJcFaJoKngSLoAnnCWKn8YCFdzT1KAHqG2uk89jM3qgWiwmSg4bKF UOhgR97ItliDUlvw/PWPl9MFMLsFos9w2NskeetwbqE30+CV5xETG5u2RwJwhfRIwM TcIxfap1idsc4mf+iCezv2b6RUxPZpqVELWpZs86XBbBtqFsw20I+qiAVIFTqAbhLb 63nS7lWKlcDXdgfxKbNhnqIxz0I7Y5HXXLtt5wzu6JaFHYjm+TUDTqr3BpN3ZpbD/f JDAPHX9l0HqcjWXCRWW1n2LPmfU78Hn7idzxBUETkZrOsp+bcJ8c/JGNXInoBhyXvJ qiTW1F0JMoZRA== Received: from thinkos.home (unknown [142.120.205.109]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4RVRxH3H5dz1M2L; Tue, 22 Aug 2023 07:30:59 -0400 (EDT) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Ingo Molnar , Valentin Schneider , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Vincent Guittot , Juri Lelli , Swapnil Sapkal , Aaron Lu , Julien Desfossez , x86@kernel.org Subject: [RFC PATCH v3 1/3] sched: Rename cpus_share_cache to cpus_share_llc Date: Tue, 22 Aug 2023 07:31:31 -0400 Message-Id: <20230822113133.643238-2-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230822113133.643238-1-mathieu.desnoyers@efficios.com> References: <20230822113133.643238-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" In preparation for introducing cpus_share_l2c, rename cpus_share_cache to cpus_share_llc, to make it clear that it specifically groups CPUs by LLC. Signed-off-by: Mathieu Desnoyers Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Vincent Guittot Cc: Juri Lelli Cc: Swapnil Sapkal Cc: Aaron Lu Cc: Julien Desfossez Cc: x86@kernel.org Tested-by: Swapnil Sapkal --- block/blk-mq.c | 2 +- include/linux/sched/topology.h | 4 ++-- kernel/sched/core.c | 4 ++-- kernel/sched/fair.c | 8 ++++---- kernel/sched/topology.c | 2 +- 5 files changed, 10 insertions(+), 10 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index b9f454613989..ed1457ca2c6d 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1136,7 +1136,7 @@ static inline bool blk_mq_complete_need_ipi(struct re= quest *rq) /* same CPU or cache domain? Complete locally */ if (cpu =3D=3D rq->mq_ctx->cpu || (!test_bit(QUEUE_FLAG_SAME_FORCE, &rq->q->queue_flags) && - cpus_share_cache(cpu, rq->mq_ctx->cpu))) + cpus_share_llc(cpu, rq->mq_ctx->cpu))) return false; =20 /* don't try to IPI to an offline CPU */ diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 816df6cc444e..7f9331f71260 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -178,7 +178,7 @@ extern void partition_sched_domains(int ndoms_new, cpum= ask_var_t doms_new[], cpumask_var_t *alloc_sched_domains(unsigned int ndoms); void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms); =20 -bool cpus_share_cache(int this_cpu, int that_cpu); +bool cpus_share_llc(int this_cpu, int that_cpu); =20 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); typedef int (*sched_domain_flags_f)(void); @@ -227,7 +227,7 @@ partition_sched_domains(int ndoms_new, cpumask_var_t do= ms_new[], { } =20 -static inline bool cpus_share_cache(int this_cpu, int that_cpu) +static inline bool cpus_share_llc(int this_cpu, int that_cpu) { return true; } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a68d1276bab0..d096ce815099 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3904,7 +3904,7 @@ void wake_up_if_idle(int cpu) rcu_read_unlock(); } =20 -bool cpus_share_cache(int this_cpu, int that_cpu) +bool cpus_share_llc(int this_cpu, int that_cpu) { if (this_cpu =3D=3D that_cpu) return true; @@ -3929,7 +3929,7 @@ static inline bool ttwu_queue_cond(struct task_struct= *p, int cpu) * If the CPU does not share cache, then queue the task on the * remote rqs wakelist to avoid accessing remote data. */ - if (!cpus_share_cache(smp_processor_id(), cpu)) + if (!cpus_share_llc(smp_processor_id(), cpu)) return true; =20 if (cpu =3D=3D smp_processor_id()) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4da5f3541762..680bbe0c7d7a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6626,7 +6626,7 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync) * a cpufreq perspective, it's better to have higher utilisation * on one CPU. */ - if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu)) + if (available_idle_cpu(this_cpu) && cpus_share_llc(this_cpu, prev_cpu)) return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu; =20 if (sync && cpu_rq(this_cpu)->nr_running =3D=3D 1) @@ -7146,7 +7146,7 @@ static int select_idle_sibling(struct task_struct *p,= int prev, int target) /* * If the previous CPU is cache affine and idle, don't be stupid: */ - if (prev !=3D target && cpus_share_cache(prev, target) && + if (prev !=3D target && cpus_share_llc(prev, target) && (available_idle_cpu(prev) || sched_idle_cpu(prev)) && asym_fits_cpu(task_util, util_min, util_max, prev)) return prev; @@ -7172,7 +7172,7 @@ static int select_idle_sibling(struct task_struct *p,= int prev, int target) p->recent_used_cpu =3D prev; if (recent_used_cpu !=3D prev && recent_used_cpu !=3D target && - cpus_share_cache(recent_used_cpu, target) && + cpus_share_llc(recent_used_cpu, target) && (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cp= u)) && cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) && asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) { @@ -7206,7 +7206,7 @@ static int select_idle_sibling(struct task_struct *p,= int prev, int target) if (sched_smt_active()) { has_idle_core =3D test_idle_cores(target); =20 - if (!has_idle_core && cpus_share_cache(prev, target)) { + if (!has_idle_core && cpus_share_llc(prev, target)) { i =3D select_idle_smt(p, prev); if ((unsigned int)i < nr_cpumask_bits) return i; diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 6682535e37c8..1ae2a0a1115a 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -661,7 +661,7 @@ static void destroy_sched_domains(struct sched_domain *= sd) * * Also keep a unique ID per domain (we use the first CPU number in * the cpumask of the domain), this allows us to quickly tell if - * two CPUs are in the same cache domain, see cpus_share_cache(). + * two CPUs are in the same cache domain, see cpus_share_llc(). */ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); DEFINE_PER_CPU(int, sd_llc_size); --=20 2.39.2 From nobody Fri Dec 19 06:33:48 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E6619EE49A3 for ; Tue, 22 Aug 2023 11:31:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234745AbjHVLbD (ORCPT ); Tue, 22 Aug 2023 07:31:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47416 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234732AbjHVLbC (ORCPT ); Tue, 22 Aug 2023 07:31:02 -0400 Received: from smtpout.efficios.com (unknown [IPv6:2607:5300:203:b2ee::31e5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 81F57CD4 for ; Tue, 22 Aug 2023 04:31:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1692703859; bh=v8MyPMO4Jmb5HBQwko1EVELi/LKGqwIlMqsnHcrxHJw=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=LYhtNoBGAEy/YK6hmXvzCJeC8gcCB5epdPYJ/Id5AXi0n+oX60faMWuceYgh8vy2P bG/EpqI1KlHejo2XDGXzOTcZ/a9/RV4j7irK5WAHNS2gMAGOdywy+taEbYBFgDRVvM 5SeHoXXwXZQ5H49rIsWHI1qdiTOI7GQ7DOFf+5hEsMw1Kneq2R40ntkCCpuA+3lrEw bDY9S19odVwbfF298AFgCwF2Gr2OBcnufolaglnOPGnSstBKl+7+YsiwslLR65/Y63 VqtXEjMy8lpKZxESp3O7R5hqTIJ1h0FCTTxVNWiXUu70iKeth2k2zZa9L6CSJJBkQN 6KEImpbVTrW8A== Received: from thinkos.home (unknown [142.120.205.109]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4RVRxH4qKpz1M2M; Tue, 22 Aug 2023 07:30:59 -0400 (EDT) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Ingo Molnar , Valentin Schneider , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Vincent Guittot , Juri Lelli , Swapnil Sapkal , Aaron Lu , Julien Desfossez , x86@kernel.org Subject: [RFC PATCH v3 2/3] sched: Introduce cpus_share_l2c Date: Tue, 22 Aug 2023 07:31:32 -0400 Message-Id: <20230822113133.643238-3-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230822113133.643238-1-mathieu.desnoyers@efficios.com> References: <20230822113133.643238-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Introduce cpus_share_l2c to allow querying whether two logical CPUs share a common L2 cache. Considering a system like the AMD EPYC 9654 96-Core Processor, the L1 cache has a latency of 4-5 cycles, the L2 cache has a latency of at least 14ns, whereas the L3 cache has a latency of 50ns [1]. Compared to this, I measured the RAM accesses to a latency around 120ns on my system [2]. So L3 really is only 2.4x faster than RAM accesses. Therefore, with this relatively slow access speed compared to L2, the scheduler will benefit from only considering CPUs sharing an L2 cache for the purpose of using remote runqueue locking rather than queued wakeups. Link: https://en.wikichip.org/wiki/amd/microarchitectures/zen_4 [1] Link: https://github.com/ChipsandCheese/MemoryLatencyTest [2] Signed-off-by: Mathieu Desnoyers Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Vincent Guittot Cc: Juri Lelli Cc: Swapnil Sapkal Cc: Aaron Lu Cc: Julien Desfossez Cc: x86@kernel.org Tested-by: Swapnil Sapkal --- Changes since v1: - Fix l2c id for configurations where L2 have a single logical CPU: use TOPOLOGY_CLUSTER_SYSFS to find out whether topology cluster is implemented or if LLC should be used as fallback. Changes since v2: - Reverse order of cpu_get_l2c_info() l2c_id and l2c_size output arguments to match the caller. --- include/linux/sched/topology.h | 6 ++++++ kernel/sched/core.c | 8 ++++++++ kernel/sched/sched.h | 2 ++ kernel/sched/topology.c | 32 +++++++++++++++++++++++++++++--- 4 files changed, 45 insertions(+), 3 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 7f9331f71260..c5fdee188bea 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -178,6 +178,7 @@ extern void partition_sched_domains(int ndoms_new, cpum= ask_var_t doms_new[], cpumask_var_t *alloc_sched_domains(unsigned int ndoms); void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms); =20 +bool cpus_share_l2c(int this_cpu, int that_cpu); bool cpus_share_llc(int this_cpu, int that_cpu); =20 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu); @@ -227,6 +228,11 @@ partition_sched_domains(int ndoms_new, cpumask_var_t d= oms_new[], { } =20 +static inline bool cpus_share_l2c(int this_cpu, int that_cpu) +{ + return true; +} + static inline bool cpus_share_llc(int this_cpu, int that_cpu) { return true; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d096ce815099..11e60a69ae31 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3904,6 +3904,14 @@ void wake_up_if_idle(int cpu) rcu_read_unlock(); } =20 +bool cpus_share_l2c(int this_cpu, int that_cpu) +{ + if (this_cpu =3D=3D that_cpu) + return true; + + return per_cpu(sd_l2c_id, this_cpu) =3D=3D per_cpu(sd_l2c_id, that_cpu); +} + bool cpus_share_llc(int this_cpu, int that_cpu) { if (this_cpu =3D=3D that_cpu) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 81ac605b9cd5..d93543db214c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1828,6 +1828,8 @@ static inline struct sched_domain *lowest_flag_domain= (int cpu, int flag) return sd; } =20 +DECLARE_PER_CPU(int, sd_l2c_size); +DECLARE_PER_CPU(int, sd_l2c_id); DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc); DECLARE_PER_CPU(int, sd_llc_size); DECLARE_PER_CPU(int, sd_llc_id); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 1ae2a0a1115a..fadb66edcf5e 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -661,8 +661,11 @@ static void destroy_sched_domains(struct sched_domain = *sd) * * Also keep a unique ID per domain (we use the first CPU number in * the cpumask of the domain), this allows us to quickly tell if - * two CPUs are in the same cache domain, see cpus_share_llc(). + * two CPUs are in the same cache domain, see cpus_share_l2c() and + * cpus_share_llc(). */ +DEFINE_PER_CPU(int, sd_l2c_size); +DEFINE_PER_CPU(int, sd_l2c_id); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc); DEFINE_PER_CPU(int, sd_llc_size); DEFINE_PER_CPU(int, sd_llc_id); @@ -672,12 +675,27 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_p= acking); DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity); DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity); =20 +#ifdef TOPOLOGY_CLUSTER_SYSFS +static int cpu_get_l2c_info(int cpu, int *l2c_size, int *l2c_id) +{ + const struct cpumask *cluster_mask =3D topology_cluster_cpumask(cpu); + + *l2c_size =3D cpumask_weight(cluster_mask); + *l2c_id =3D cpumask_first(cluster_mask); + return 0; +} +#else +static int cpu_get_l2c_info(int cpu, int *l2c_size, int *l2c_id) +{ + return -1; +} +#endif + static void update_top_cache_domain(int cpu) { struct sched_domain_shared *sds =3D NULL; struct sched_domain *sd; - int id =3D cpu; - int size =3D 1; + int id =3D cpu, size =3D 1, l2c_id, l2c_size; =20 sd =3D highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES); if (sd) { @@ -686,6 +704,14 @@ static void update_top_cache_domain(int cpu) sds =3D sd->shared; } =20 + if (cpu_get_l2c_info(cpu, &l2c_size, &l2c_id)) { + /* Fallback on using LLC. */ + l2c_size =3D size; + l2c_id =3D id; + } + per_cpu(sd_l2c_size, cpu) =3D l2c_size; + per_cpu(sd_l2c_id, cpu) =3D l2c_id; + rcu_assign_pointer(per_cpu(sd_llc, cpu), sd); per_cpu(sd_llc_size, cpu) =3D size; per_cpu(sd_llc_id, cpu) =3D id; --=20 2.39.2 From nobody Fri Dec 19 06:33:48 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CA1D8EE4996 for ; Tue, 22 Aug 2023 11:31:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234751AbjHVLbF (ORCPT ); Tue, 22 Aug 2023 07:31:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47420 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234733AbjHVLbC (ORCPT ); Tue, 22 Aug 2023 07:31:02 -0400 Received: from smtpout.efficios.com (unknown [IPv6:2607:5300:203:b2ee::31e5]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AF373CD5 for ; Tue, 22 Aug 2023 04:31:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1692703860; bh=SGb+2EYDa1JyIeIJLif+q9cXee8PjYwXk3X5TSjwFec=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=L9HijNvYj0P5Ja84a0b/yEUeAyKZ7yuUjLrQVrB7/Tb9CcZ5dLjc9t+MyBXSsJegc LODWBd8f1hH2yj1sUaWFizc2jsZba3ofJuodWQDyPlKO63qPFlUgbYGDHxkaT40f0W KdSZeSBKK9Ab+42kixrzgRD6fvbkN2HW28Ev6FLsHUN3fXmGlsFkR1Dw1RP717BDTl 38MdAxdamuyi+4wWw4umZsri/BLW1gPy/G5BRMmt45wP0zxXQgZ8KwAfclY765HzbP di1fIUaPwBYy+a1UDoRQ92VOuMvrPdKeFLqlyHOtayBCalX9BMyLYi7Vpi1b+U76r8 S7CtY9PuPCRxA== Received: from thinkos.home (unknown [142.120.205.109]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4RVRxH6MXZz1MGJ; Tue, 22 Aug 2023 07:30:59 -0400 (EDT) From: Mathieu Desnoyers To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Ingo Molnar , Valentin Schneider , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Vincent Guittot , Juri Lelli , Swapnil Sapkal , Aaron Lu , Julien Desfossez , x86@kernel.org Subject: [RFC PATCH v3 3/3] sched: ttwu_queue_cond: skip queued wakeups across different l2 caches Date: Tue, 22 Aug 2023 07:31:33 -0400 Message-Id: <20230822113133.643238-4-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20230822113133.643238-1-mathieu.desnoyers@efficios.com> References: <20230822113133.643238-1-mathieu.desnoyers@efficios.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Considering a system like the AMD EPYC 9654 96-Core Processor, the L1 cache has a latency of 4-5 cycles, the L2 cache has a latency of at least 14ns, whereas the L3 cache has a latency of 50ns [1]. Compared to this, I measured the RAM accesses to a latency around 120ns on my system [2]. So L3 really is only 2.4x faster than RAM accesses. Therefore, with this relatively slow access speed compared to L2, the scheduler will benefit from only considering CPUs sharing an L2 cache for the purpose of using remote runqueue locking rather than queued wakeups. Skipping queued wakeups for all logical CPUs sharing an LLC means that on a 192 cores AMD EPYC 9654 96-Core Processor (over 2 sockets), groups of 8 cores (16 hardware threads) end up grabbing runqueue locks of other runqueues within the same group for each wakeup, causing contention on the runqueue locks. Improve this by only considering logical cpus sharing an L2 cache as candidates for skipping use of the queued wakeups. This results in the following benchmark improvements: hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100 from 49s to 34s. (30% speedup) And similarly with perf bench: perf bench sched messaging -g 32 -p -t -l 100000 from 10.9s to 7.4s (32% speedup) I have noticed that in order to observe the speedup, the workload needs to keep the CPUs sufficiently busy to cause runqueue lock contention, but not so busy that they don't go idle. This can be explained by the fact that idle CPUs are a preferred target for task wakeup runqueue selection, and therefore having idle cpus causes more migrations, which triggers more remote wakeups. For both the hackbench and the perf bench sched messaging benchmarks, the scale of the workload can be tweaked by changing the number groups. This was developed as part of the investigation into a weird regression reported by AMD where adding a raw spinlock in the scheduler context switch accelerated hackbench. It turned out that changing this raw spinlock for a loop of 10000x cpu_relax within do_idle() had similar benefits. This patch achieves a similar effect without busy waiting nor changing anything about runqueue selection on wakeup. It considers that only hardware threads sharing an L2 cache should skip the queued try-to-wakeup and directly grab the target runqueue lock, rather than allowing all hardware threads sharing an LLC to do so. I would be interested to hear feedback about performance impact of this patch (improvement or regression) on other workloads and hardware, especially for Intel CPUs. One thing that we might want to empirically figure out from the topology is whether there is a maximum number of hardware threads within an LLC below which it would make sense to use the LLC rather than L2 as group within which queued wakeups can be skipped. Link: https://en.wikichip.org/wiki/amd/microarchitectures/zen_4 [1] Link: https://github.com/ChipsandCheese/MemoryLatencyTest [2] Link: https://lore.kernel.org/r/09e0f469-a3f7-62ef-75a1-e64cec2dcfc5@amd.com Link: https://lore.kernel.org/lkml/20230725193048.124796-1-mathieu.desnoyer= s@efficios.com/ Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers= @efficios.com/ Link: https://lore.kernel.org/lkml/20230810140635.75296-1-mathieu.desnoyers= @efficios.com/ Signed-off-by: Mathieu Desnoyers Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Valentin Schneider Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Vincent Guittot Cc: Juri Lelli Cc: Swapnil Sapkal Cc: Aaron Lu Cc: Julien Desfossez Cc: x86@kernel.org Tested-by: Swapnil Sapkal --- kernel/sched/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 11e60a69ae31..317f4cec4653 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3934,10 +3934,10 @@ static inline bool ttwu_queue_cond(struct task_stru= ct *p, int cpu) return false; =20 /* - * If the CPU does not share cache, then queue the task on the + * If the CPU does not share L2 cache, then queue the task on the * remote rqs wakelist to avoid accessing remote data. */ - if (!cpus_share_llc(smp_processor_id(), cpu)) + if (!cpus_share_l2c(smp_processor_id(), cpu)) return true; =20 if (cpu =3D=3D smp_processor_id()) --=20 2.39.2