From nobody Mon Oct 6 15:01:32 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.20]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 389C921D3C6 for ; Mon, 21 Jul 2025 06:06:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.20 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753078002; cv=none; b=t+QtK3uG8Ser01yAXvP/6ijVAN/zYfyK/Y+SbdeROfR6Y58Dqojlwb8Rh7i5MGk683poMDeh3WSYc5ZqKWGuvwmziAP1Rwc6zsMyAfZoh/oFgMejBx4Fb/B3V16AhJF/Dg11lvOyyl85//pI0QJ3akhdX9wCK66M2KPVcsvtBVo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753078002; c=relaxed/simple; bh=AYOj5ojQE2MSw2izIVt9uZK1DO01G76Lg6keWmq3XSE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=a0BHYgYzeUMj1bGiwPsaHMRzZF6Z9KfYyKFGmnwP10nmX6O4Qb6WK4I5K5QxexQcLP3he79HlRo1AgMjKfeZ7b1binbO0z40JfV+MNsH4qU0CdqG9N/rGhoDePki9gPHACEeeqfVVPCzyU9iBo5/0IUtViTulZJEem0CDu9yEPk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=fHqoNV65; arc=none smtp.client-ip=198.175.65.20 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="fHqoNV65" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1753078001; x=1784614001; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=AYOj5ojQE2MSw2izIVt9uZK1DO01G76Lg6keWmq3XSE=; b=fHqoNV65YkSEO9ar6ZGKoW6jbLa29IAc1wzxJQBCjhMSeuV70fKRiSbd GKapgXS4gEaxc7zTb5aYtttEXHwyBTTOjiG85mp0tDjaj3vFsuN3Iydm6 ZQpPngt/brEnTdJ79Cv8V55xbJ68hFZlBa3FeOz7sNofTHLO1AoclMH2H 3W/riaHAniLYd2dStRJg8jf1MVrrb0CKFqBfZ0lk/tHVnvFDPzMKSaW+2 45c+SLMtYnPl7RyJ7PME4+zQHmHp1dP4bv0zuAk1vG5BaQY4hCoQfpx19 2DLq+mZvoEJ5HkOc0oEKbzSR0VUV+/OeAqudu4RX6ElsxlrOZK9V2B4R6 Q==; X-CSE-ConnectionGUID: njyZ20Y5RVqgqYUU0U6Mew== X-CSE-MsgGUID: RifTUvRwQJGwnToTDw+mHQ== X-IronPort-AV: E=McAfee;i="6800,10657,11498"; a="54994967" X-IronPort-AV: E=Sophos;i="6.16,328,1744095600"; d="scan'208";a="54994967" Received: from fmviesa001.fm.intel.com ([10.60.135.141]) by orvoesa112.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jul 2025 23:06:41 -0700 X-CSE-ConnectionGUID: uhLgVkdvSlSeOzo+5inhsA== X-CSE-MsgGUID: uDPRVH0gR4ija2ww8f+SbA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,328,1744095600"; d="scan'208";a="189696143" Received: from linux-pnp-server-17.sh.intel.com ([10.239.166.49]) by fmviesa001.fm.intel.com with ESMTP; 20 Jul 2025 23:06:38 -0700 From: Pan Deng To: peterz@infradead.org, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, tianyou.li@intel.com, tim.c.chen@linux.intel.com, yu.c.chen@intel.com, pan.deng@intel.com Subject: [PATCH v2 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Date: Mon, 21 Jul 2025 14:10:25 +0800 Message-ID: X-Mailer: git-send-email 2.43.5 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When running a multi-instance FFmpeg workload on HCC system, significant contention is observed on root_domain `rto_count` and `overloaded` fields. The SUT is a 2-socket machine with 240 physical cores and 480 logical CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 with FIFO scheduling. FPS is used as score. perf c2c tool reveals: root_domain cache line 1: - `rto_count` (0x4) is frequently loaded/stored - `overloaded` (0x28) is heavily loaded - cycles per load: ~2.8K to 44K: A separate patch rearranges root_domain to place `overloaded` on a different cache line, but this alone is insufficient to resolve the contention on `rto_count`. As a complementary, this patch splits `rto_count` into per-numa-node counters to reduce the contention. With this change: - FPS improves by ~4% - Kernel cycles% drops from ~20% to ~18.6% - The cache line no longer appears in perf-c2c report - stress-ng cyclic benchmark is improved ~50.4%, command: stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \ --timeout 30 --minimize --metrics - rt-tests/pi_stress is improved ~5.1%, command: rt-tests/pi_stress -D 30 -g $(($(nproc) / 2)) Appendix: 1. Perf c2c report of root_domain cache line 1: Reviewed-by: Chen Yu Reviewed-by: Tianyou Li ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 231 43 48 0xff14d42c400e3800 ------- ------- ------ ------ ------ ------ ------------------------ 22.51% 18.60% 0.00% 0x4 5041 247 pull_rt_task 5.63% 2.33% 45.83% 0x4 6995 315 dequeue_pushable_task 3.90% 4.65% 54.17% 0x4 6587 370 enqueue_pushable_task 0.43% 0.00% 0.00% 0x4 17111 4 enqueue_pushable_task 0.43% 0.00% 0.00% 0x4 44062 4 dequeue_pushable_task 32.03% 27.91% 0.00% 0x28 6393 285 enqueue_task_rt 16.45% 27.91% 0.00% 0x28 5534 139 sched_balance_newidle 14.72% 18.60% 0.00% 0x28 5287 110 dequeue_task_rt 3.46% 0.00% 0.00% 0x28 2820 25 enqueue_task_fair 0.43% 0.00% 0.00% 0x28 220 3 enqueue_task_stop Signed-off-by: Pan Deng Reviewed-by: Tianyou Li Reviewed-by: Chen Yu --- V1 -> V2: Fixed non CONFIG_SMP build issue --- kernel/sched/rt.c | 56 ++++++++++++++++++++++++++++++++++++++--- kernel/sched/sched.h | 9 ++++++- kernel/sched/topology.c | 7 ++++++ 3 files changed, 68 insertions(+), 4 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index e40422c37033..cbcfd3aa3439 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -337,9 +337,58 @@ static inline bool need_pull_rt_task(struct rq *rq, st= ruct task_struct *prev) return rq->online && rq->rt.highest_prio.curr > prev->prio; } =20 +int rto_counts_init(atomic_tp **rto_counts) +{ + int i; + atomic_tp *counts =3D kzalloc(nr_node_ids * sizeof(atomic_tp), GFP_KERNEL= ); + + if (!counts) + return -ENOMEM; + + for (i =3D 0; i < nr_node_ids; i++) { + counts[i] =3D kzalloc_node(sizeof(atomic_t), GFP_KERNEL, i); + + if (!counts[i]) + goto cleanup; + } + + *rto_counts =3D counts; + return 0; + +cleanup: + while (i--) + kfree(counts[i]); + + kfree(counts); + return -ENOMEM; +} + +void rto_counts_cleanup(atomic_tp *rto_counts) +{ + for (int i =3D 0; i < nr_node_ids; i++) + kfree(rto_counts[i]); + + kfree(rto_counts); +} + static inline int rt_overloaded(struct rq *rq) { - return atomic_read(&rq->rd->rto_count); + int count =3D 0; + int cur_node, nid; + + cur_node =3D numa_node_id(); + + for (int i =3D 0; i < nr_node_ids; i++) { + nid =3D (cur_node + i) % nr_node_ids; + count +=3D atomic_read(rq->rd->rto_counts[nid]); + + // The caller only checks if it is 0 + // or 1, so that return once > 1 + if (count > 1) + return count; + } + + return count; } =20 static inline void rt_set_overload(struct rq *rq) @@ -358,7 +407,7 @@ static inline void rt_set_overload(struct rq *rq) * Matched by the barrier in pull_rt_task(). */ smp_wmb(); - atomic_inc(&rq->rd->rto_count); + atomic_inc(rq->rd->rto_counts[cpu_to_node(rq->cpu)]); } =20 static inline void rt_clear_overload(struct rq *rq) @@ -367,7 +416,7 @@ static inline void rt_clear_overload(struct rq *rq) return; =20 /* the order here really doesn't matter */ - atomic_dec(&rq->rd->rto_count); + atomic_dec(rq->rd->rto_counts[cpu_to_node(rq->cpu)]); cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask); } =20 @@ -443,6 +492,7 @@ static inline void dequeue_pushable_task(struct rq *rq,= struct task_struct *p) static inline void rt_queue_push_tasks(struct rq *rq) { } + #endif /* CONFIG_SMP */ =20 static void enqueue_top_rt_rq(struct rt_rq *rt_rq); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bc67806911f2..13fc3ac3381b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -953,6 +953,8 @@ struct perf_domain { struct rcu_head rcu; }; =20 +typedef atomic_t *atomic_tp; + /* * We add the notion of a root-domain which will be used to define per-dom= ain * variables. Each exclusive cpuset essentially defines an island domain by @@ -963,12 +965,15 @@ struct perf_domain { */ struct root_domain { atomic_t refcount; - atomic_t rto_count; struct rcu_head rcu; cpumask_var_t span; cpumask_var_t online; =20 atomic_t dlo_count; + + /* rto_count per node */ + atomic_tp *rto_counts; + struct dl_bw dl_bw; struct cpudl cpudl; =20 @@ -1030,6 +1035,8 @@ extern int sched_init_domains(const struct cpumask *c= pu_map); extern void rq_attach_root(struct rq *rq, struct root_domain *rd); extern void sched_get_rd(struct root_domain *rd); extern void sched_put_rd(struct root_domain *rd); +extern int rto_counts_init(atomic_tp **rto_counts); +extern void rto_counts_cleanup(atomic_tp *rto_counts); =20 static inline int get_rd_overloaded(struct root_domain *rd) { diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b958fe48e020..166dc8177a44 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -457,6 +457,7 @@ static void free_rootdomain(struct rcu_head *rcu) { struct root_domain *rd =3D container_of(rcu, struct root_domain, rcu); =20 + rto_counts_cleanup(rd->rto_counts); cpupri_cleanup(&rd->cpupri); cpudl_cleanup(&rd->cpudl); free_cpumask_var(rd->dlo_mask); @@ -549,8 +550,14 @@ static int init_rootdomain(struct root_domain *rd) =20 if (cpupri_init(&rd->cpupri) !=3D 0) goto free_cpudl; + + if (rto_counts_init(&rd->rto_counts) !=3D 0) + goto free_cpupri; + return 0; =20 +free_cpupri: + cpupri_cleanup(&rd->cpupri); free_cpudl: cpudl_cleanup(&rd->cpudl); free_rto_mask: --=20 2.43.5