From nobody Tue Oct 7 19:50:27 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 288C4219EB for ; Mon, 7 Jul 2025 02:31:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751855462; cv=none; b=Ri2Q7wrxyORki7sKs5dGhGhJghTeR5OqF5ZXWvT39zSoWuqzHtsGKiHhUJBNL7ENLriobvB1dFxZeBtNnlO2cqoNQNQX1+xEX8UfmqyaA+TVaQILp1zAB1uGb0jPvYZQJ43cKgaozB1ZjZCYS+WmesvK7aM9gld1TvfZSIpqoH8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751855462; c=relaxed/simple; bh=+FmKbdGqKKJjQjR99VcYD+sfcZZdqxfqEH658C1uJOQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=UVEzs9Spnngjhtz+akKAE3Wl6GdQP0GGHzOd1zLmPtVdklrOK1Xfv/jdiIibHX8jWz8YJJmRshvcX1zLeCO/RJpq60FJd1zEc4rF+CC8b6M5UN1+duo9qyA6PU1tl72oRe95lGiIy99V6vEZMoB0SYyIIwthin+T5y5IJVBtaOU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=N4qRma1+; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="N4qRma1+" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1751855461; x=1783391461; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+FmKbdGqKKJjQjR99VcYD+sfcZZdqxfqEH658C1uJOQ=; b=N4qRma1+zWvIbxeaMWKq/WNzcI1PyaJZOso/9e7XqrIjLSySTC1hGjbm 84zwN7xodpblCpKsO4zN8YQ4tNUTip1Q0XtXNO9/wkAxznIrJr2jx9Qdo GP/jQGric7Z7VGVkQedC78sSHKwKX4pAKVg4ivW9TH4STa9WLS0Ci5t2i RR+5YmsCw/7OjJAm30uDqFF5qfVNQC62R0zdZI+WpoHu2vgFirhOsRS+b lsmybAVQ7CV6WcKVo0SKmahodyZksj1EKzbfNfDDcT4gk9HirQgc33LPY HqqZZSCmHlX0pfxSuwMnmvaEDmmI3IMxDVuZDWKto0x4HH6LBbdk5iKk0 g==; X-CSE-ConnectionGUID: oViLwanLRQ6cxtC9GfUfaw== X-CSE-MsgGUID: 6cXIN2RVQ7u8ZoF1fhF3rQ== X-IronPort-AV: E=McAfee;i="6800,10657,11486"; a="64756915" X-IronPort-AV: E=Sophos;i="6.16,293,1744095600"; d="scan'208";a="64756915" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2025 19:31:00 -0700 X-CSE-ConnectionGUID: hayDqQuATk2cmyKadK4uDg== X-CSE-MsgGUID: aO9VYv9SRpm0kCAO/F/6hQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,293,1744095600"; d="scan'208";a="159361632" Received: from linux-pnp-server-17.sh.intel.com ([10.239.166.49]) by fmviesa005.fm.intel.com with ESMTP; 06 Jul 2025 19:30:58 -0700 From: Pan Deng To: peterz@infradead.org, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, tianyou.li@intel.com, tim.c.chen@linux.intel.com, yu.c.chen@intel.com, pan.deng@intel.com Subject: [PATCH 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention Date: Mon, 7 Jul 2025 10:35:25 +0800 Message-ID: X-Mailer: git-send-email 2.43.5 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When running a multi-instance FFmpeg workload on an HCC system, significant cache line contention is observed around `cpupri_vec->count` and `mask` in struct root_domain. The SUT is a 2-socket machine with 240 physical cores and 480 logical CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 with FIFO scheduling. FPS is used as score. perf c2c tool reveals: root_domain cache line 3: - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored and contends with other fields, since counts[0] is more frequently updated than others along with a rt task enqueues an empty runq or dequeues from a non-overloaded runq. - cycles per load: ~10K to 59K cpupri's last cache line: - `cpupri_vec->count` and `mask` contends. The transcoding threads use rt pri 99, so that the contention occurs in the end. - cycles per load: ~1.5K to 10.5K This change mitigates `cpupri_vec->count`, `mask` related contentions by separating each count and mask into different cache lines. As a result: - FPS improves by ~11% - Kernel cycles% drops from ~20% to ~11% - `count` and `mask` related cache line contention is mitigated, perf c2c shows root_domain cache line 3 `cycles per load` drops from ~10K-59K to ~0.5K-8K, cpupri's last cache line no longer appears in the report. Note: The side effect of this change is that struct cpupri size is increased from 26 cache lines to 203 cache lines. An alternative approach could be separating `counts` and `masks` into 2 vectors in cpupri_vec (counts[] and masks[]), and add two paddings: 1. Between counts[0] and counts[1], since counts[0] is more frequently updated than others. 2. Between the two vectors, since counts[] is read-write access while masks[] is read access when it stores pointers. The alternative approach introduces the complexity of 31+/21- LoC changes, it achieves almost the same performance, at the same time, struct cpupri size is reduced from 26 cache lines to 21 cache lines. Appendix: 1. Current layout of contended data structures: struct root_domain { atomic_t refcount; /* 0 4 */ atomic_t rto_count; /* 4 4 */ struct callback_head rcu __attribute__((__aligned__(8)));/*8 16 */ cpumask_var_t span; /* 24 8 */ cpumask_var_t online; /* 32 8 */ bool overloaded; /* 40 1 */ bool overutilized; /* 41 1 */ /* XXX 6 bytes hole, try to pack */ cpumask_var_t dlo_mask; /* 48 8 */ atomic_t dlo_count; /* 56 4 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dl_bw dl_bw; /* 64 24 */ struct cpudl cpudl; /* 88 24 */ u64 visit_gen; /* 112 8 */ struct irq_work rto_push_work; /* 120 32 */ /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */ raw_spinlock_t rto_lock; /* 152 4 */ int rto_loop; /* 156 4 */ int rto_cpu; /* 160 4 */ atomic_t rto_loop_next; /* 164 4 */ atomic_t rto_loop_start; /* 168 4 */ /* XXX 4 bytes hole, try to pack */ cpumask_var_t rto_mask; /* 176 8 */ struct cpupri cpupri; /* 184 1624 */ /* --- cacheline 28 boundary (1792 bytes) was 16 bytes ago --- */ struct perf_domain * pd; /* 1808 8 */ /* size: 1816, cachelines: 29, members: 21 */ /* sum members: 1802, holes: 3, sum holes: 14 */ /* forced alignments: 1 */ /* last cacheline: 24 bytes */ } __attribute__((__aligned__(8))); struct cpupri { struct cpupri_vec pri_to_cpu[101]; /* 0 1616 */ /* --- cacheline 25 boundary (1600 bytes) was 16 bytes ago --- */ int * cpu_to_pri; /* 1616 8 */ /* size: 1624, cachelines: 26, members: 2 */ /* last cacheline: 24 bytes */ }; struct cpupri_vec { atomic_t count; /* 0 4 */ /* XXX 4 bytes hole, try to pack */ cpumask_var_t mask; /* 8 8 */ /* size: 16, cachelines: 1, members: 2 */ /* sum members: 12, holes: 1, sum holes: 4 */ /* last cacheline: 16 bytes */ }; 2. Perf c2c report of root_domain cache line 3: Reviewed-by: Tim Chen ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 353 44 62 0xff14d42c400e3880 ------- ------- ------ ------ ------ ------ ------------------------ 0.00% 2.27% 0.00% 0x0 21683 6 __flush_smp_call_function_ 0.00% 2.27% 0.00% 0x0 22294 5 __flush_smp_call_function_ 0.28% 0.00% 0.00% 0x0 0 2 irq_work_queue_on 0.28% 0.00% 0.00% 0x0 27824 4 irq_work_single 0.00% 0.00% 1.61% 0x0 28151 6 irq_work_queue_on 0.57% 0.00% 0.00% 0x18 21822 8 native_queued_spin_lock_sl 0.28% 2.27% 0.00% 0x18 16101 10 native_queued_spin_lock_sl 0.57% 0.00% 0.00% 0x18 33199 5 native_queued_spin_lock_sl 0.00% 0.00% 1.61% 0x18 10908 32 _raw_spin_lock 0.00% 0.00% 1.61% 0x18 59770 2 _raw_spin_lock 0.00% 0.00% 1.61% 0x18 0 1 _raw_spin_unlock 1.42% 0.00% 0.00% 0x20 12918 20 pull_rt_task 0.85% 0.00% 25.81% 0x24 31123 199 pull_rt_task 0.85% 0.00% 3.23% 0x24 38218 24 pull_rt_task 0.57% 4.55% 19.35% 0x28 30558 207 pull_rt_task 0.28% 0.00% 0.00% 0x28 55504 10 pull_rt_task 18.70% 18.18% 0.00% 0x30 26438 291 dequeue_pushable_task 17.28% 22.73% 0.00% 0x30 29347 281 enqueue_pushable_task 1.70% 2.27% 0.00% 0x30 12819 31 enqueue_pushable_task 0.28% 0.00% 0.00% 0x30 17726 18 dequeue_pushable_task 34.56% 29.55% 0.00% 0x38 25509 527 cpupri_find_fitness 13.88% 11.36% 24.19% 0x38 30654 342 cpupri_set 3.12% 2.27% 0.00% 0x38 18093 39 cpupri_set 1.70% 0.00% 0.00% 0x38 37661 52 cpupri_find_fitness 1.42% 2.27% 19.35% 0x38 31110 211 cpupri_set 1.42% 0.00% 1.61% 0x38 45035 31 cpupri_set 3. Perf c2c report of cpupri's last cache line ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 149 43 41 0xff14d42c400e3ec0 ------- ------- ------ ------ ------ ------ ------------------------ 8.72% 11.63% 0.00% 0x8 2001 165 cpupri_find_fitness 1.34% 2.33% 0.00% 0x18 1456 151 cpupri_find_fitness 8.72% 9.30% 58.54% 0x28 1744 263 cpupri_set 2.01% 4.65% 41.46% 0x28 1958 301 cpupri_set 1.34% 0.00% 0.00% 0x28 10580 6 cpupri_set 69.80% 67.44% 0.00% 0x30 1754 347 cpupri_set 8.05% 4.65% 0.00% 0x30 2144 256 cpupri_set Signed-off-by: Pan Deng Signed-off-by: Tianyou Li Reviewed-by: Tim Chen --- kernel/sched/cpupri.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h index d6cba0020064..245b0fa626be 100644 --- a/kernel/sched/cpupri.h +++ b/kernel/sched/cpupri.h @@ -9,7 +9,7 @@ =20 struct cpupri_vec { atomic_t count; - cpumask_var_t mask; + cpumask_var_t mask ____cacheline_aligned; }; =20 struct cpupri { --=20 2.43.5 From nobody Tue Oct 7 19:50:27 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7BA26219EB for ; Mon, 7 Jul 2025 02:31:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751855469; cv=none; b=qKczaEapWaMHyocqbeGtA/iXOIpjdqsfTYFxBF82I153blw30aTpc6NEcw8VHtfqIqrgHTb88BPMfEvQ2anLkCgY90oBZU7nlc8x6Bd8M5cXMKpRspRmaDIT+wS7MNilNn0JGl8UnF7ApsXYFWyDdHM3vFs2DM0mDf4WZgCrffs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751855469; c=relaxed/simple; bh=INsrMXxJaR4TYSiTwWFZ+ZgC7VSJVkTzbnSQstrPyNQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hMFqUcKjpHFv7hddciq4xtAgSHWpluHT2Ecw7RiQj5pZYauwhdap8QPjPGAeImsdHYM93ruwLCvXYN1ngRrXkrOJ1UPPLjvEricEWJYtOMNI7ngDMnZE1/SflHGxn44b54/2IpT+eh58eS6gsSowSrUnnKEBrbtL1zF3UYgeMGc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=ZkPV5Cdb; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="ZkPV5Cdb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1751855467; x=1783391467; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=INsrMXxJaR4TYSiTwWFZ+ZgC7VSJVkTzbnSQstrPyNQ=; b=ZkPV5CdbEsXP5Qjzz20rQPo42OF70jGb2DxLXpjzTh1ULRVZIMnMFPG7 7VIvkZpZdDWiXoqb/2eSLSOfmgqAJeiR8W31549jWhMcY8+7D1c4Av9vZ e4g6fG6DWQbPRPK6Dy3qsQq0LECGc3e4Q+hShuZvipoytFDY7+Sikpt3v OdpYLmUHtRZoLfPFCfg224rT/5WkcsnZ8g9fe6xx+Gk12NpVyVZ4p8MBG Keti2p1t1IkrHZIHovv4OtwDo7qZKBxhNtfvYD6OoAqEFenmdUo2Nn0S7 i1YeNGZgBw3QfP4X/p2vusJaE0wGkW9LKY6t6jeDs9jXrpiQyegp6Anr1 g==; X-CSE-ConnectionGUID: zaOhaQ3sRjip49ZgPupyxg== X-CSE-MsgGUID: wP3IqZbITxuBwcKhCBTq0A== X-IronPort-AV: E=McAfee;i="6800,10657,11486"; a="64756921" X-IronPort-AV: E=Sophos;i="6.16,293,1744095600"; d="scan'208";a="64756921" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2025 19:31:07 -0700 X-CSE-ConnectionGUID: yXTzLmzGSXSLUYwwIJ5NKg== X-CSE-MsgGUID: QgCODDMjR/Kod+Pdm7caXQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,293,1744095600"; d="scan'208";a="159361642" Received: from linux-pnp-server-17.sh.intel.com ([10.239.166.49]) by fmviesa005.fm.intel.com with ESMTP; 06 Jul 2025 19:31:05 -0700 From: Pan Deng To: peterz@infradead.org, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, tianyou.li@intel.com, tim.c.chen@linux.intel.com, yu.c.chen@intel.com, pan.deng@intel.com Subject: [PATCH 2/4] sched/rt: Restructure root_domain to reduce cacheline contention Date: Mon, 7 Jul 2025 10:35:26 +0800 Message-ID: <55013ca7954e4e7309f56bec7f62b62561162e19.1751852370.git.pan.deng@intel.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When running a multi-instance FFmpeg workload on HCC system, significant contention is observed in root_domain cacheline 1 and 3. The SUT is a 2-socket machine with 240 physical cores and 480 logical CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 with FIFO scheduling. FPS is used as score. perf c2c tool reveals (sorted by contention severity): root_domain cache line 3: - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored, since counts[0] is more frequently updated than others along with a rt task enqueues an empty runq or dequeues from a non-overloaded runq. - `rto_mask` (0x30) is heavily loaded - `rto_loop_next` (0x24) and `rto_loop_start` (0x28) are frequently stored - `rto_push_work` (0x0) and `rto_lock` (0x18) are lightly accessed - cycles per load: ~10K to 59K root_domain cache line 1: - `rto_count` (0x4) is frequently loaded/stored - `overloaded` (0x28) is heavily loaded - cycles per load: ~2.8K to 44K: This change adjusts the layout of `root_domain` to isolate these contended fields across separate cache lines: 1. `rto_count` remains in the 1st cache line; `overloaded` and `overutilized` are moved to the last cache line 2. `rto_push_work` is placed in the 2nd cache line 3. `rto_loop_start`, `rto_loop_next`, and `rto_lock` remain in the 3rd cache line; `rto_mask` is moved near `pd` in the penultimate cache line 4. `cpupri` starts at the 4th cache line to prevent `pri_to_cpu[0].count` contending with fields in cache line 3. With this change: - FPS improves by ~5% - Kernel cycles% drops from ~20% to ~17.7% - root_domain cache line 3 no longer appears in perf-c2c report - cycles per load of root_domain cache line 1 is reduced to from ~2.8K-44K to ~2.1K-2.7K According to the nature of the change, to my understanding, it doesn`t introduce any negative impact in other scenario. Note: This change increases the size of `root_domain` from 29 to 31 cache lines, it's considered acceptable since `root_domain` is a single global object. Appendix: 1. Current layout of contended data structures: struct root_domain { atomic_t refcount; /* 0 4 */ atomic_t rto_count; /* 4 4 */ struct callback_head rcu __attribute__((__aligned__(8)));/*8 16 */ cpumask_var_t span; /* 24 8 */ cpumask_var_t online; /* 32 8 */ bool overloaded; /* 40 1 */ bool overutilized; /* 41 1 */ /* XXX 6 bytes hole, try to pack */ cpumask_var_t dlo_mask; /* 48 8 */ atomic_t dlo_count; /* 56 4 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ struct dl_bw dl_bw; /* 64 24 */ struct cpudl cpudl; /* 88 24 */ u64 visit_gen; /* 112 8 */ struct irq_work rto_push_work; /* 120 32 */ /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */ raw_spinlock_t rto_lock; /* 152 4 */ int rto_loop; /* 156 4 */ int rto_cpu; /* 160 4 */ atomic_t rto_loop_next; /* 164 4 */ atomic_t rto_loop_start; /* 168 4 */ /* XXX 4 bytes hole, try to pack */ cpumask_var_t rto_mask; /* 176 8 */ struct cpupri cpupri; /* 184 1624 */ /* --- cacheline 28 boundary (1792 bytes) was 16 bytes ago --- */ struct perf_domain * pd; /* 1808 8 */ /* size: 1816, cachelines: 29, members: 21 */ /* sum members: 1802, holes: 3, sum holes: 14 */ /* forced alignments: 1 */ /* last cacheline: 24 bytes */ } __attribute__((__aligned__(8))); struct cpupri { struct cpupri_vec pri_to_cpu[101]; /* 0 1616 */ /* --- cacheline 25 boundary (1600 bytes) was 16 bytes ago --- */ int * cpu_to_pri; /* 1616 8 */ /* size: 1624, cachelines: 26, members: 2 */ /* last cacheline: 24 bytes */ }; struct cpupri_vec { atomic_t count; /* 0 4 */ /* XXX 4 bytes hole, try to pack */ cpumask_var_t mask; /* 8 8 */ /* size: 16, cachelines: 1, members: 2 */ /* sum members: 12, holes: 1, sum holes: 4 */ /* last cacheline: 16 bytes */ }; 2. Perf c2c report of root_domain cache line 3: Reviewed-by: Chen Yu Reviewed-by: Tianyou Li ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 353 44 62 0xff14d42c400e3880 ------- ------- ------ ------ ------ ------ ------------------------ 0.00% 2.27% 0.00% 0x0 21683 6 __flush_smp_call_function_ 0.00% 2.27% 0.00% 0x0 22294 5 __flush_smp_call_function_ 0.28% 0.00% 0.00% 0x0 0 2 irq_work_queue_on 0.28% 0.00% 0.00% 0x0 27824 4 irq_work_single 0.00% 0.00% 1.61% 0x0 28151 6 irq_work_queue_on 0.57% 0.00% 0.00% 0x18 21822 8 native_queued_spin_lock_sl 0.28% 2.27% 0.00% 0x18 16101 10 native_queued_spin_lock_sl 0.57% 0.00% 0.00% 0x18 33199 5 native_queued_spin_lock_sl 0.00% 0.00% 1.61% 0x18 10908 32 _raw_spin_lock 0.00% 0.00% 1.61% 0x18 59770 2 _raw_spin_lock 0.00% 0.00% 1.61% 0x18 0 1 _raw_spin_unlock 1.42% 0.00% 0.00% 0x20 12918 20 pull_rt_task 0.85% 0.00% 25.81% 0x24 31123 199 pull_rt_task 0.85% 0.00% 3.23% 0x24 38218 24 pull_rt_task 0.57% 4.55% 19.35% 0x28 30558 207 pull_rt_task 0.28% 0.00% 0.00% 0x28 55504 10 pull_rt_task 18.70% 18.18% 0.00% 0x30 26438 291 dequeue_pushable_task 17.28% 22.73% 0.00% 0x30 29347 281 enqueue_pushable_task 1.70% 2.27% 0.00% 0x30 12819 31 enqueue_pushable_task 0.28% 0.00% 0.00% 0x30 17726 18 dequeue_pushable_task 34.56% 29.55% 0.00% 0x38 25509 527 cpupri_find_fitness 13.88% 11.36% 24.19% 0x38 30654 342 cpupri_set 3.12% 2.27% 0.00% 0x38 18093 39 cpupri_set 1.70% 0.00% 0.00% 0x38 37661 52 cpupri_find_fitness 1.42% 2.27% 19.35% 0x38 31110 211 cpupri_set 1.42% 0.00% 1.61% 0x38 45035 31 cpupri_set 3. Perf c2c report of root_domain cache line 1: ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 231 43 48 0xff14d42c400e3800 ------- ------- ------ ------ ------ ------ ------------------------ 22.51% 18.60% 0.00% 0x4 5041 247 pull_rt_task 5.63% 2.33% 45.83% 0x4 6995 315 dequeue_pushable_task 3.90% 4.65% 54.17% 0x4 6587 370 enqueue_pushable_task 0.43% 0.00% 0.00% 0x4 17111 4 enqueue_pushable_task 0.43% 0.00% 0.00% 0x4 44062 4 dequeue_pushable_task 32.03% 27.91% 0.00% 0x28 6393 285 enqueue_task_rt 16.45% 27.91% 0.00% 0x28 5534 139 sched_balance_newidle 14.72% 18.60% 0.00% 0x28 5287 110 dequeue_task_rt 3.46% 0.00% 0.00% 0x28 2820 25 enqueue_task_fair 0.43% 0.00% 0.00% 0x28 220 3 enqueue_task_stop Signed-off-by: Pan Deng Reviewed-by: Tianyou Li Reviewed-by: Chen Yu --- kernel/sched/sched.h | 52 +++++++++++++++++++++++--------------------- 1 file changed, 27 insertions(+), 25 deletions(-) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 475bb5998295..dd3c79470bfc 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -968,24 +968,29 @@ struct root_domain { cpumask_var_t span; cpumask_var_t online; =20 + atomic_t dlo_count; + struct dl_bw dl_bw; + struct cpudl cpudl; + +#ifdef HAVE_RT_PUSH_IPI /* - * Indicate pullable load on at least one CPU, e.g: - * - More than one runnable task - * - Running task is misfit + * For IPI pull requests, loop across the rto_mask. */ - bool overloaded; - - /* Indicate one or more CPUs over-utilized (tipping point) */ - bool overutilized; + struct irq_work rto_push_work; + raw_spinlock_t rto_lock; + /* These are only updated and read within rto_lock */ + int rto_loop; + int rto_cpu; + /* These atomics are updated outside of a lock */ + atomic_t rto_loop_next; + atomic_t rto_loop_start; +#endif =20 /* * The bit corresponding to a CPU gets set here if such CPU has more * than one runnable -deadline task (as it is below for RT tasks). */ cpumask_var_t dlo_mask; - atomic_t dlo_count; - struct dl_bw dl_bw; - struct cpudl cpudl; =20 /* * Indicate whether a root_domain's dl_bw has been checked or @@ -995,32 +1000,29 @@ struct root_domain { * that u64 is 'big enough'. So that shouldn't be a concern. */ u64 visit_cookie; + struct cpupri cpupri ____cacheline_aligned; =20 -#ifdef HAVE_RT_PUSH_IPI /* - * For IPI pull requests, loop across the rto_mask. + * NULL-terminated list of performance domains intersecting with the + * CPUs of the rd. Protected by RCU. */ - struct irq_work rto_push_work; - raw_spinlock_t rto_lock; - /* These are only updated and read within rto_lock */ - int rto_loop; - int rto_cpu; - /* These atomics are updated outside of a lock */ - atomic_t rto_loop_next; - atomic_t rto_loop_start; -#endif + struct perf_domain __rcu *pd ____cacheline_aligned; + /* * The "RT overload" flag: it gets set if a CPU has more than * one runnable RT task. */ cpumask_var_t rto_mask; - struct cpupri cpupri; =20 /* - * NULL-terminated list of performance domains intersecting with the - * CPUs of the rd. Protected by RCU. + * Indicate pullable load on at least one CPU, e.g: + * - More than one runnable task + * - Running task is misfit */ - struct perf_domain __rcu *pd; + bool overloaded ____cacheline_aligned; + + /* Indicate one or more CPUs over-utilized (tipping point) */ + bool overutilized; }; =20 extern void init_defrootdomain(void); --=20 2.43.5 From nobody Tue Oct 7 19:50:27 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B867A1CAA85 for ; Mon, 7 Jul 2025 02:31:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751855474; cv=none; b=TNo2XIvisuKp8WTH7+kidTdrGWH/992vrDFQwdsk7IBoHMRFTJRBB/Vuvl/OynT64ejMrKmfma2LqNc2MWqE5ImOljP4B/qJKOfMzO0lHOUzvFik4xoVephQW1SCnIjXlcchXmCIbqV2bvc2AE7a34bRVWcKlv1VcxGju0M4snU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751855474; c=relaxed/simple; bh=Y3t6YLGIXDhsEQt1TnIgGDS52aTj4duWDmAWbWZYpvc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ahrPbrYRXkwbfLbYnGGI1c0KB7o9/zfj/jJ6XpwKiuhjC5I+pk3r+QTdGnghnbywwK8Shf9Cyy3sqcX4E+hu9FPe6zWzcMlqnkFP49euIjWbciQducelA85Ob0YX/QdD/F4xhf4JpB/P7rFw8CCbMtrqJRiZOf2aQdzaOkILmFM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=SgT/Qu9F; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="SgT/Qu9F" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1751855473; x=1783391473; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Y3t6YLGIXDhsEQt1TnIgGDS52aTj4duWDmAWbWZYpvc=; b=SgT/Qu9Fi/94XC0tLhPya1C7Sxh7LP3bjSagKI3fQoWzenfUd41ukdDW m/ogYEBDuaIdF416fof0EKExLSNjJLK1L7WL0njx9qyrsrFOH4Lyl+9C0 kNk3YybUcsOsZ3fkY42jPCW1nRJmL7P6KVJe4U6Ol3NUdHeSw4KFvW6NA kv2AVn4N483Iq+iqCoO8QSZ59vFuVu3ypLw1tzXsuN3TlLzqg9b4FiSMO Mc8uTpu1LyIpVExtMu1JUVZpU1O/l8acYOf6QlUSLXlaIPcUJKhiZb1M6 QHx/Zurq//DqAh+Y58uLsydsRWsS3yWBi5LbO2MkdZ2cF64IYDjhl8s+W A==; X-CSE-ConnectionGUID: /h28OwldSfqh8ChJHUvkpQ== X-CSE-MsgGUID: T7lcVoxDTD67X8dEN8hKGg== X-IronPort-AV: E=McAfee;i="6800,10657,11486"; a="64756925" X-IronPort-AV: E=Sophos;i="6.16,293,1744095600"; d="scan'208";a="64756925" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2025 19:31:12 -0700 X-CSE-ConnectionGUID: klHiVf19QLi+9EUxkXTREQ== X-CSE-MsgGUID: cqmBWf73SEG/aG67lEOQWg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,293,1744095600"; d="scan'208";a="159361651" Received: from linux-pnp-server-17.sh.intel.com ([10.239.166.49]) by fmviesa005.fm.intel.com with ESMTP; 06 Jul 2025 19:31:10 -0700 From: Pan Deng To: peterz@infradead.org, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, tianyou.li@intel.com, tim.c.chen@linux.intel.com, yu.c.chen@intel.com, pan.deng@intel.com Subject: [PATCH 3/4] sched/rt: Split root_domain->rto_count to per-NUMA-node counters Date: Mon, 7 Jul 2025 10:35:27 +0800 Message-ID: <2c1e1dbacaddd881f3cca340ece1f9268029b620.1751852370.git.pan.deng@intel.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When running a multi-instance FFmpeg workload on HCC system, significant contention is observed on root_domain `rto_count` and `overloaded` fields. The SUT is a 2-socket machine with 240 physical cores and 480 logical CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 with FIFO scheduling. FPS is used as score. perf c2c tool reveals: root_domain cache line 1: - `rto_count` (0x4) is frequently loaded/stored - `overloaded` (0x28) is heavily loaded - cycles per load: ~2.8K to 44K: A separate patch rearranges root_domain to place `overloaded` on a different cache line, but this alone is insufficient to resolve the contention on `rto_count`. As a complementary, this patch splits `rto_count` into per-numa-node counters to reduce the contention. With this change: - FPS improves by ~4% - Kernel cycles% drops from ~20% to ~18.6% - The cache line no longer appears in perf-c2c report Appendix: 1. Perf c2c report of root_domain cache line 1: Reviewed-by: Chen Yu Reviewed-by: Tianyou Li ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 231 43 48 0xff14d42c400e3800 ------- ------- ------ ------ ------ ------ ------------------------ 22.51% 18.60% 0.00% 0x4 5041 247 pull_rt_task 5.63% 2.33% 45.83% 0x4 6995 315 dequeue_pushable_task 3.90% 4.65% 54.17% 0x4 6587 370 enqueue_pushable_task 0.43% 0.00% 0.00% 0x4 17111 4 enqueue_pushable_task 0.43% 0.00% 0.00% 0x4 44062 4 dequeue_pushable_task 32.03% 27.91% 0.00% 0x28 6393 285 enqueue_task_rt 16.45% 27.91% 0.00% 0x28 5534 139 sched_balance_newidle 14.72% 18.60% 0.00% 0x28 5287 110 dequeue_task_rt 3.46% 0.00% 0.00% 0x28 2820 25 enqueue_task_fair 0.43% 0.00% 0.00% 0x28 220 3 enqueue_task_stop Signed-off-by: Pan Deng Reviewed-by: Tianyou Li Reviewed-by: Chen Yu --- kernel/sched/rt.c | 65 +++++++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h | 9 +++++- kernel/sched/topology.c | 7 +++++ 3 files changed, 77 insertions(+), 4 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index e40422c37033..cc820dbde6d6 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -337,9 +337,58 @@ static inline bool need_pull_rt_task(struct rq *rq, st= ruct task_struct *prev) return rq->online && rq->rt.highest_prio.curr > prev->prio; } =20 +int rto_counts_init(atomic_tp **rto_counts) +{ + int i; + atomic_tp *counts =3D kzalloc(nr_node_ids * sizeof(atomic_tp), GFP_KERNEL= ); + + if (!counts) + return -ENOMEM; + + for (i =3D 0; i < nr_node_ids; i++) { + counts[i] =3D kzalloc_node(sizeof(atomic_t), GFP_KERNEL, i); + + if (!counts[i]) + goto cleanup; + } + + *rto_counts =3D counts; + return 0; + +cleanup: + while (i--) + kfree(counts[i]); + + kfree(counts); + return -ENOMEM; +} + +void rto_counts_cleanup(atomic_tp *rto_counts) +{ + for (int i =3D 0; i < nr_node_ids; i++) + kfree(rto_counts[i]); + + kfree(rto_counts); +} + static inline int rt_overloaded(struct rq *rq) { - return atomic_read(&rq->rd->rto_count); + int count =3D 0; + int cur_node, nid; + + cur_node =3D numa_node_id(); + + for (int i =3D 0; i < nr_node_ids; i++) { + nid =3D (cur_node + i) % nr_node_ids; + count +=3D atomic_read(rq->rd->rto_counts[nid]); + + // The caller only checks if it is 0 + // or 1, so that return once > 1 + if (count > 1) + return count; + } + + return count; } =20 static inline void rt_set_overload(struct rq *rq) @@ -358,7 +407,7 @@ static inline void rt_set_overload(struct rq *rq) * Matched by the barrier in pull_rt_task(). */ smp_wmb(); - atomic_inc(&rq->rd->rto_count); + atomic_inc(rq->rd->rto_counts[cpu_to_node(rq->cpu)]); } =20 static inline void rt_clear_overload(struct rq *rq) @@ -367,7 +416,7 @@ static inline void rt_clear_overload(struct rq *rq) return; =20 /* the order here really doesn't matter */ - atomic_dec(&rq->rd->rto_count); + atomic_dec(rq->rd->rto_counts[cpu_to_node(rq->cpu)]); cpumask_clear_cpu(rq->cpu, rq->rd->rto_mask); } =20 @@ -443,6 +492,16 @@ static inline void dequeue_pushable_task(struct rq *rq= , struct task_struct *p) static inline void rt_queue_push_tasks(struct rq *rq) { } + +int rto_counts_init(atomic_tp **rto_counts) +{ + return 0; +} + +void rto_counts_cleanup(atomic_tp *rto_counts) +{ +} + #endif /* CONFIG_SMP */ =20 static void enqueue_top_rt_rq(struct rt_rq *rt_rq); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index dd3c79470bfc..f80968724dd6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -953,6 +953,8 @@ struct perf_domain { struct rcu_head rcu; }; =20 +typedef atomic_t *atomic_tp; + /* * We add the notion of a root-domain which will be used to define per-dom= ain * variables. Each exclusive cpuset essentially defines an island domain by @@ -963,12 +965,15 @@ struct perf_domain { */ struct root_domain { atomic_t refcount; - atomic_t rto_count; struct rcu_head rcu; cpumask_var_t span; cpumask_var_t online; =20 atomic_t dlo_count; + + /* rto_count per node */ + atomic_tp *rto_counts; + struct dl_bw dl_bw; struct cpudl cpudl; =20 @@ -1030,6 +1035,8 @@ extern int sched_init_domains(const struct cpumask *c= pu_map); extern void rq_attach_root(struct rq *rq, struct root_domain *rd); extern void sched_get_rd(struct root_domain *rd); extern void sched_put_rd(struct root_domain *rd); +extern int rto_counts_init(atomic_tp **rto_counts); +extern void rto_counts_cleanup(atomic_tp *rto_counts); =20 static inline int get_rd_overloaded(struct root_domain *rd) { diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index b958fe48e020..166dc8177a44 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -457,6 +457,7 @@ static void free_rootdomain(struct rcu_head *rcu) { struct root_domain *rd =3D container_of(rcu, struct root_domain, rcu); =20 + rto_counts_cleanup(rd->rto_counts); cpupri_cleanup(&rd->cpupri); cpudl_cleanup(&rd->cpudl); free_cpumask_var(rd->dlo_mask); @@ -549,8 +550,14 @@ static int init_rootdomain(struct root_domain *rd) =20 if (cpupri_init(&rd->cpupri) !=3D 0) goto free_cpudl; + + if (rto_counts_init(&rd->rto_counts) !=3D 0) + goto free_cpupri; + return 0; =20 +free_cpupri: + cpupri_cleanup(&rd->cpupri); free_cpudl: cpudl_cleanup(&rd->cpudl); free_rto_mask: --=20 2.43.5 From nobody Tue Oct 7 19:50:27 2025 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42FCA1C5F13 for ; Mon, 7 Jul 2025 02:31:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.9 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751855480; cv=none; b=QoMEljnfr0RmXLuQb6HG/Pb+Wi8/2/QTIA0P4/RSucwZncmmZFc5TEQWRhuxMseO0C9BWe62thVNU9vR/g+53r0rz90g/W53cHgGXl3a+OZUYet/tpDcR93kl3Wd5hLs9uheNGrNDrP1/kYTVpTxxfFtHtkXW3k5mfDCMC/Efrg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1751855480; c=relaxed/simple; bh=7qbEj6s3E5X0GSkOpL6ZOYwOYfavYlJO/DbmeamEDeM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Wydf3w68b8skWBLNpXnevlkz1K08ahMei3mGyeIdlriv+7tvpBKcJ0ugbrUG1nQWKTVhKHg4BZLdglJmpNyfC0mtYVl48rhmDdfXYcQglm5cmfEmdG4/9GRd8x/jrh1v6K4Nyxj1gybfVy+CH6yzShxgHXRxG+XlspOgiR49SkI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com; spf=pass smtp.mailfrom=intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Qo1uE+E8; arc=none smtp.client-ip=192.198.163.9 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Qo1uE+E8" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1751855478; x=1783391478; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=7qbEj6s3E5X0GSkOpL6ZOYwOYfavYlJO/DbmeamEDeM=; b=Qo1uE+E8aAY6sLqmQX5zLgcBOkSHxHQdcbR4M7v8fD4/NpClSimkJLcu qTCpLKH9UrAhnSIpD0aEGX8wOMpoEKsgM57/gUZj2O/yAfkSJQeBuyyV7 W0pCQ+hKQpa1qojRa6OdAm4ihHls7h+3DJFR3SEd9E97GywaBle0qvbKB t6lIw+iNQnEJPb/fuGcKQmZRWzl8v+biO1oCpw+PROUUVg6RfijzMucas g7sJEnW5SVzt7bNBEU2GZ2OyuO8ZrcWmlbsUVSY0KIjBmKqrnB1bvY6Og A4WF8ATOx9R35nzNYsKGH1B4pBI5Cz5uZdgojB8nGLu4oGFL5asKIn2+1 Q==; X-CSE-ConnectionGUID: fN9qQi2gSGeDqE1/oWKiaA== X-CSE-MsgGUID: IcdWEifNQyGFB6fHjopppQ== X-IronPort-AV: E=McAfee;i="6800,10657,11486"; a="64756935" X-IronPort-AV: E=Sophos;i="6.16,293,1744095600"; d="scan'208";a="64756935" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2025 19:31:18 -0700 X-CSE-ConnectionGUID: MahLDiB2SNW+N/5yZGjMqA== X-CSE-MsgGUID: 5caasCASRAGaL+i3+sce6g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,293,1744095600"; d="scan'208";a="159361662" Received: from linux-pnp-server-17.sh.intel.com ([10.239.166.49]) by fmviesa005.fm.intel.com with ESMTP; 06 Jul 2025 19:31:16 -0700 From: Pan Deng To: peterz@infradead.org, mingo@kernel.org Cc: linux-kernel@vger.kernel.org, tianyou.li@intel.com, tim.c.chen@linux.intel.com, yu.c.chen@intel.com, pan.deng@intel.com Subject: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention Date: Mon, 7 Jul 2025 10:35:28 +0800 Message-ID: <3755b9d2bf78da2ae593a9c92d8e79dddcaa9877.1751852370.git.pan.deng@intel.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When running a multi-instance FFmpeg workload on HCC system, significant contention is observed on bitmap of `cpupri_vec->cpumask`. The SUT is a 2-socket machine with 240 physical cores and 480 logical CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99 with FIFO scheduling. FPS is used as score. perf c2c tool reveals: cpumask (bitmap) cache line of `cpupri_vec->mask`: - bits are loaded during cpupri_find - bits are stored during cpupri_set - cycles per load: ~2.2K to 8.7K This change splits `cpupri_vec->cpumask` into per-NUMA-node data to mitigate false sharing. As a result: - FPS improves by ~3.8% - Kernel cycles% drops from ~20% to ~18.7% - Cache line contention is mitigated, perf-c2c shows cycles per load drops from ~2.2K-8.7K to ~0.5K-2.2K Note: CONFIG_CPUMASK_OFFSTACK=3Dn remains unchanged. Appendix: 1. Perf c2c report of `cpupri_vec->mask` bitmap cache line: Reviewed-by: Chen Yu Reviewed-by: Tianyou Li ------- ------- ------ ------ ------ ------ ------------------------ Rmt Lcl Store Data Load Total Symbol Hitm% Hitm% L1 Hit% offset cycles records ------- ------- ------ ------ ------ ------ ------------------------ 155 39 39 0xff14d52c4682d800 ------- ------- ------ ------ ------ ------ ------------------------ 43.23% 43.59% 0.00% 0x0 3489 415 _find_first_and_bit 3.23% 5.13% 0.00% 0x0 3478 107 __bitmap_and 3.23% 0.00% 0.00% 0x0 2712 33 _find_first_and_bit 1.94% 0.00% 7.69% 0x0 5992 33 cpupri_set 0.00% 0.00% 5.13% 0x0 3733 19 cpupri_set 12.90% 12.82% 0.00% 0x8 3452 297 _find_first_and_bit 1.29% 2.56% 0.00% 0x8 3007 117 __bitmap_and 0.00% 5.13% 0.00% 0x8 3041 20 _find_first_and_bit 0.00% 2.56% 2.56% 0x8 2374 22 cpupri_set 0.00% 0.00% 7.69% 0x8 4194 38 cpupri_set 8.39% 2.56% 0.00% 0x10 3336 264 _find_first_and_bit 3.23% 0.00% 0.00% 0x10 3023 46 _find_first_and_bit 2.58% 0.00% 0.00% 0x10 3040 130 __bitmap_and 1.29% 0.00% 12.82% 0x10 4075 34 cpupri_set 0.00% 0.00% 2.56% 0x10 2197 19 cpupri_set 0.00% 2.56% 7.69% 0x18 4085 27 cpupri_set 0.00% 2.56% 0.00% 0x18 3128 220 _find_first_and_bit 0.00% 0.00% 5.13% 0x18 3028 20 cpupri_set 2.58% 2.56% 0.00% 0x20 3089 198 _find_first_and_bit 1.29% 0.00% 5.13% 0x20 5114 29 cpupri_set 0.65% 2.56% 0.00% 0x20 3224 96 __bitmap_and 0.65% 0.00% 7.69% 0x20 4392 31 cpupri_set 2.58% 0.00% 0.00% 0x28 3327 214 _find_first_and_bit 0.65% 2.56% 5.13% 0x28 5252 31 cpupri_set 0.65% 0.00% 7.69% 0x28 8755 25 cpupri_set 0.65% 0.00% 0.00% 0x28 4414 14 _find_first_and_bit 1.29% 2.56% 0.00% 0x30 3139 171 _find_first_and_bit 0.65% 0.00% 7.69% 0x30 2185 18 cpupri_set 0.65% 0.00% 0.00% 0x30 3404 108 __bitmap_and 0.00% 0.00% 2.56% 0x30 5542 21 cpupri_set 3.23% 5.13% 0.00% 0x38 3493 190 _find_first_and_bit 3.23% 2.56% 0.00% 0x38 3171 108 __bitmap_and 0.00% 2.56% 7.69% 0x38 3285 14 cpupri_set 0.00% 0.00% 5.13% 0x38 4035 27 cpupri_set Signed-off-by: Pan Deng Reviewed-by: Tianyou Li Reviewed-by: Chen Yu --- kernel/sched/cpupri.c | 200 ++++++++++++++++++++++++++++++++++++++---- kernel/sched/cpupri.h | 4 + 2 files changed, 186 insertions(+), 18 deletions(-) diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c index 42c40cfdf836..306b6baff4cd 100644 --- a/kernel/sched/cpupri.c +++ b/kernel/sched/cpupri.c @@ -64,6 +64,143 @@ static int convert_prio(int prio) return cpupri; } =20 +#ifdef CONFIG_CPUMASK_OFFSTACK +static inline int alloc_vec_masks(struct cpupri_vec *vec) +{ + int i; + + for (i =3D 0; i < nr_node_ids; i++) { + if (!zalloc_cpumask_var_node(&vec->masks[i], GFP_KERNEL, i)) + goto cleanup; + + // Clear masks of cur node, set others + bitmap_complement(cpumask_bits(vec->masks[i]), + cpumask_bits(cpumask_of_node(i)), small_cpumask_bits); + } + return 0; + +cleanup: + while (i--) + free_cpumask_var(vec->masks[i]); + return -ENOMEM; +} + +static inline void free_vec_masks(struct cpupri_vec *vec) +{ + for (int i =3D 0; i < nr_node_ids; i++) + free_cpumask_var(vec->masks[i]); +} + +static inline int setup_vec_mask_var_ts(struct cpupri *cp) +{ + int i; + + for (i =3D 0; i < CPUPRI_NR_PRIORITIES; i++) { + struct cpupri_vec *vec =3D &cp->pri_to_cpu[i]; + + vec->masks =3D kcalloc(nr_node_ids, sizeof(cpumask_var_t), GFP_KERNEL); + if (!vec->masks) + goto cleanup; + } + return 0; + +cleanup: + /* Free any already allocated masks */ + while (i--) { + kfree(cp->pri_to_cpu[i].masks); + cp->pri_to_cpu[i].masks =3D NULL; + } + + return -ENOMEM; +} + +static inline void free_vec_mask_var_ts(struct cpupri *cp) +{ + for (int i =3D 0; i < CPUPRI_NR_PRIORITIES; i++) { + kfree(cp->pri_to_cpu[i].masks); + cp->pri_to_cpu[i].masks =3D NULL; + } +} + +static inline int +available_cpu_in_nodes(struct task_struct *p, struct cpupri_vec *vec) +{ + int cur_node =3D numa_node_id(); + + for (int i =3D 0; i < nr_node_ids; i++) { + int nid =3D (cur_node + i) % nr_node_ids; + + if (cpumask_first_and_and(&p->cpus_mask, vec->masks[nid], + cpumask_of_node(nid)) < nr_cpu_ids) + return 1; + } + + return 0; +} + +#define available_cpu_in_vec available_cpu_in_nodes + +#else /* !CONFIG_CPUMASK_OFFSTACK */ + +static inline int alloc_vec_masks(struct cpupri_vec *vec) +{ + if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL)) + return -ENOMEM; + + return 0; +} + +static inline void free_vec_masks(struct cpupri_vec *vec) +{ + free_cpumask_var(vec->mask); +} + +static inline int setup_vec_mask_var_ts(struct cpupri *cp) +{ + return 0; +} + +static inline void free_vec_mask_var_ts(struct cpupri *cp) +{ +} + +static inline int +available_cpu_in_vec(struct task_struct *p, struct cpupri_vec *vec) +{ + if (cpumask_any_and(&p->cpus_mask, vec->mask) >=3D nr_cpu_ids) + return 0; + + return 1; +} +#endif + +static inline int alloc_all_masks(struct cpupri *cp) +{ + int i; + + for (i =3D 0; i < CPUPRI_NR_PRIORITIES; i++) { + if (alloc_vec_masks(&cp->pri_to_cpu[i])) + goto cleanup; + } + + return 0; + +cleanup: + while (i--) + free_vec_masks(&cp->pri_to_cpu[i]); + + return -ENOMEM; +} + +static inline void setup_vec_counts(struct cpupri *cp) +{ + for (int i =3D 0; i < CPUPRI_NR_PRIORITIES; i++) { + struct cpupri_vec *vec =3D &cp->pri_to_cpu[i]; + + atomic_set(&vec->count, 0); + } +} + static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p, struct cpumask *lowest_mask, int idx) { @@ -96,11 +233,24 @@ static inline int __cpupri_find(struct cpupri *cp, str= uct task_struct *p, if (skip) return 0; =20 - if (cpumask_any_and(&p->cpus_mask, vec->mask) >=3D nr_cpu_ids) + if (!available_cpu_in_vec(p, vec)) return 0; =20 +#ifdef CONFIG_CPUMASK_OFFSTACK + struct cpumask *cpupri_mask =3D lowest_mask; + + // available && lowest_mask + if (lowest_mask) { + cpumask_copy(cpupri_mask, vec->masks[0]); + for (int nid =3D 1; nid < nr_node_ids; nid++) + cpumask_and(cpupri_mask, cpupri_mask, vec->masks[nid]); + } +#else + struct cpumask *cpupri_mask =3D vec->mask; +#endif + if (lowest_mask) { - cpumask_and(lowest_mask, &p->cpus_mask, vec->mask); + cpumask_and(lowest_mask, &p->cpus_mask, cpupri_mask); cpumask_and(lowest_mask, lowest_mask, cpu_active_mask); =20 /* @@ -229,7 +379,11 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri) if (likely(newpri !=3D CPUPRI_INVALID)) { struct cpupri_vec *vec =3D &cp->pri_to_cpu[newpri]; =20 +#ifdef CONFIG_CPUMASK_OFFSTACK + cpumask_set_cpu(cpu, vec->masks[cpu_to_node(cpu)]); +#else cpumask_set_cpu(cpu, vec->mask); +#endif /* * When adding a new vector, we update the mask first, * do a write memory barrier, and then update the count, to @@ -263,7 +417,11 @@ void cpupri_set(struct cpupri *cp, int cpu, int newpri) */ atomic_dec(&(vec)->count); smp_mb__after_atomic(); +#ifdef CONFIG_CPUMASK_OFFSTACK + cpumask_clear_cpu(cpu, vec->masks[cpu_to_node(cpu)]); +#else cpumask_clear_cpu(cpu, vec->mask); +#endif } =20 *currpri =3D newpri; @@ -279,26 +437,31 @@ int cpupri_init(struct cpupri *cp) { int i; =20 - for (i =3D 0; i < CPUPRI_NR_PRIORITIES; i++) { - struct cpupri_vec *vec =3D &cp->pri_to_cpu[i]; - - atomic_set(&vec->count, 0); - if (!zalloc_cpumask_var(&vec->mask, GFP_KERNEL)) - goto cleanup; - } - + /* Allocate the cpu_to_pri array */ cp->cpu_to_pri =3D kcalloc(nr_cpu_ids, sizeof(int), GFP_KERNEL); if (!cp->cpu_to_pri) - goto cleanup; + return -ENOMEM; =20 + /* Initialize all CPUs to invalid priority */ for_each_possible_cpu(i) cp->cpu_to_pri[i] =3D CPUPRI_INVALID; =20 + /* Setup priority vectors */ + setup_vec_counts(cp); + if (setup_vec_mask_var_ts(cp)) + goto fail_setup_vectors; + + /* Allocate masks for each priority vector */ + if (alloc_all_masks(cp)) + goto fail_alloc_masks; + return 0; =20 -cleanup: - for (i--; i >=3D 0; i--) - free_cpumask_var(cp->pri_to_cpu[i].mask); +fail_alloc_masks: + free_vec_mask_var_ts(cp); + +fail_setup_vectors: + kfree(cp->cpu_to_pri); return -ENOMEM; } =20 @@ -308,9 +471,10 @@ int cpupri_init(struct cpupri *cp) */ void cpupri_cleanup(struct cpupri *cp) { - int i; - kfree(cp->cpu_to_pri); - for (i =3D 0; i < CPUPRI_NR_PRIORITIES; i++) - free_cpumask_var(cp->pri_to_cpu[i].mask); + + for (int i =3D 0; i < CPUPRI_NR_PRIORITIES; i++) + free_vec_masks(&cp->pri_to_cpu[i]); + + free_vec_mask_var_ts(cp); } diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h index 245b0fa626be..c53f1f4dad86 100644 --- a/kernel/sched/cpupri.h +++ b/kernel/sched/cpupri.h @@ -9,7 +9,11 @@ =20 struct cpupri_vec { atomic_t count; +#ifdef CONFIG_CPUMASK_OFFSTACK + cpumask_var_t *masks ____cacheline_aligned; +#else cpumask_var_t mask ____cacheline_aligned; +#endif }; =20 struct cpupri { --=20 2.43.5