From nobody Thu Jan 30 17:31:03 2025 Received: from NAM10-MW2-obe.outbound.protection.outlook.com (mail-mw2nam10on2085.outbound.protection.outlook.com [40.107.94.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 767051D88DB for ; Mon, 27 Jan 2025 22:27:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.94.85 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738016853; cv=fail; b=qxu7KI93mey5QXFPtB5n2Bqd7H0U16GspkDa0ae7c+Cpr/2CTV7brPDidgj/9JJB/x2OmBODHh7V1qg/e6VTZ/J98Eym2nGf8o9HgchjvEPZ9wsSj4ZahPdW9l8O2zufLWtbRGErocFzSeaHkLJyhgZewVFisGbvdgeU04Kaitw= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738016853; c=relaxed/simple; bh=czZuELi1T8tuf0OGDWaEGd8dR5By6RN7+OErOxLHsws=; h=From:To:Cc:Subject:Date:Message-ID:Content-Type:MIME-Version; b=RSVS9w5msbVcrETFWR9jwJrnoUkzEpDviHS9DKHqR6bAanfVXFMOZDa5868k8z0/4olB2KWNsKK2LAaPA2LGBQ2tym7D6G1wFyz3HjQqyu2G8tVKIEDOFQaLc0aveZucJ869z0XMOsu2l4aV6BokxkOaDG/hXoKW3l9rGId/Bj4= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com; spf=fail smtp.mailfrom=nvidia.com; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b=qeaPkfsB; arc=fail smtp.client-ip=40.107.94.85 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=nvidia.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b="qeaPkfsB" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=UkS855hH8cFS9FDRPw4fIh7OCSgwKUgzv6qkyTY4ZL5NhTcK9eSwWbNpUoBk5oHgv/+ugKUAGel1Zs3+H+u0UqoPnsEjVyfoPpHFbSVxzy26J/aWfWBg/pF7pVazCbAAxEga+bv31D0D107d80mHxLjSSGoI1ZNk0fVFAxjXlzR9WANz1HF8v9+VqBboRMfUUi5sCniswrDdrdRqKUitcOTsUb8MAoGeGwfSfqVmpXIYEmLyLenGHUMvLuDGOaZ5gYeoXaZTC0+crW11MDUnKw3dskfoZSkwKRFuQ4hXt8uzY3xnC4HOVN/q5enfNpZatf8uSWLrXN8Sn14q/HW5ng== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=p2balqffWP0kZzX9AMnMX4ZQEI+mZMyZ/zWiu9DNa1Y=; b=BjymZ0b6BCeACRWsL8PWxwrl1VgBmT8uxmjB7ne9dAtQ/wGjFn5cNuwXVjcSFb5DUg0OoFSotm1qnoz4WWxPYebrz+uVXT5BtQk0smKRKDyhJb+/watZsrcaWrZm9jV245h5Exiq06/9/W1D7tt4t66ZESH8SxA8W7JVS1vDMvscIStiRlvmBUvkFbXaUtFEJ5PuO7km5kOxCFGA84EMoaYnOpFbWP1CzAL8q3eL3hUBIeUm+3dOUC1ow2UsJNh7HumQay9nL9RdxL+fWnLm/R3GWd0+nsPceuHhDvNUEVHm3oOlgJ0uUccNUSxdGmNEgjhG00Qj/AkYDOHYUyHGUg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=nvidia.com; dmarc=pass action=none header.from=nvidia.com; dkim=pass header.d=nvidia.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=p2balqffWP0kZzX9AMnMX4ZQEI+mZMyZ/zWiu9DNa1Y=; b=qeaPkfsBx1/xQsMeZt6JsSMKSuCE2zv94JDJx/efeHXEkPJ8X0qFWNKWUiuKxPY1tNmWq2KhSq2sG1a3OSd3/pSrsVT4HH5SO1l4jb0oyslo5jmhXu7tCGuPqnp5/MyFhayy0m0cW2nGIwkVX7QPnvaFHS2koIQDllT4gJTOzQdQQD9yLzwZ1lyqi7utY8e+eNHxY5Se3K0VfSa6BbzinVC/G2ez6iex6BfeEIF1cd89LStBtlcJem+L1X5whtlANil6YLC0SBYrJckVuYHtwStGewjr54jJUXN22GA+sKhJuVafjxYOatHPOTqNLf3c0ohB063Y20pjQvZK1TNVCA== Authentication-Results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=nvidia.com; Received: from CY5PR12MB6405.namprd12.prod.outlook.com (2603:10b6:930:3e::17) by DM3PR12MB9435.namprd12.prod.outlook.com (2603:10b6:0:40::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8377.23; Mon, 27 Jan 2025 22:27:24 +0000 Received: from CY5PR12MB6405.namprd12.prod.outlook.com ([fe80::2119:c96c:b455:53b5]) by CY5PR12MB6405.namprd12.prod.outlook.com ([fe80::2119:c96c:b455:53b5%4]) with mapi id 15.20.8377.021; Mon, 27 Jan 2025 22:27:24 +0000 From: Andrea Righi To: Tejun Heo , David Vernet , Changwoo Min Cc: Yury Norov , linux-kernel@vger.kernel.org Subject: [PATCH v3] sched_ext: Move built-in idle CPU selection policy to a separate file Date: Mon, 27 Jan 2025 23:27:21 +0100 Message-ID: <20250127222721.627121-1-arighi@nvidia.com> X-Mailer: git-send-email 2.48.1 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: FR0P281CA0245.DEUP281.PROD.OUTLOOK.COM (2603:10a6:d10:af::6) To CY5PR12MB6405.namprd12.prod.outlook.com (2603:10b6:930:3e::17) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: CY5PR12MB6405:EE_|DM3PR12MB9435:EE_ X-MS-Office365-Filtering-Correlation-Id: 00359a56-a078-4059-4f61-08dd3f21c5ec X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|366016|376014|7053199007; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?l/19izUCoyNOjttCOXyc44tP+H29fEavgJPqMUCWndau2PKJTgQddMHHvIsi?= =?us-ascii?Q?ZcxisvIjQ8q4fFck17bUrtVv+Z8KTPhnCkGswBXa8xaBoC4IzgR20xKbM3z9?= =?us-ascii?Q?3QIMVnuq2P7l1vgo159J0UqfihAPnnkM87Hlmz1jl1zpDmXiURGNSP9U5H+I?= =?us-ascii?Q?y40SV9neOfHp3xf+7lzQ0tr9cuzpB9ui8Pid9Rk0lJ24p2VyBtf5L296qy6v?= =?us-ascii?Q?Os/7EHYa+MlKAAyzw7dI3bKNSzwI19f46Zib353+cVqecKjifI5Qqfm6JrbX?= =?us-ascii?Q?+LGNU/hVyEdUwrxj9gvSU9D3QFux6qTMCWsSGE+NyEf9uVqXUz1gp8r5PADk?= =?us-ascii?Q?g2/J8U285Zq8MfRIDvPGRwx45jJo6xkPIwTLXXBNsXnOTedurzzQVQlwoPQ+?= =?us-ascii?Q?qlC4vfBpnqsQ9DCQElWGl6l5rw8u6TZwOU1YldkY/ElCEg4ifldgJ+G8bNnX?= =?us-ascii?Q?TrAugcbIawygrVf9NuZnc+08tGt50vWmM/3ldwromuDryeFhoy5KrNwOVZpU?= =?us-ascii?Q?44WKmhJe6qkNwuiEvmASsvTaJk8GhFd0nNPR7L8Cwkkv7/Wsmp0Vb8vHKiw9?= =?us-ascii?Q?NWU64kyyR+qb87s3jFnZuVr0yeq/EAKq8tt33RkyVLTMpzxq93yb6HAJXVxI?= =?us-ascii?Q?u/4A+mV0kqPyOQtfxt0qsLXRNz7mzjAdV3NtMu/goMdTh9WuCmtWlyWXcrb5?= =?us-ascii?Q?HKzfVYLhrEGiWMxEP2IN0xS9RJYHM+mcCD/W7qlSxuQOL1qvaG4+dN9Hbc4P?= =?us-ascii?Q?l34/l9XFkLgQE0MHV5Li1Yncacy40FTXN+rgHKpQr/mfJKhj+12RN+LNq9ZP?= =?us-ascii?Q?/rEXn4ZDfmQD9EnImS0zzdWmlBchaCAaRHrD/pNAyGM20Yl9M/R1NS7e8oJo?= =?us-ascii?Q?A1cJqriSIrf8tSeOYOQDw+wdSuGaVN0ALTguW5PpYaTmqqjIkdvMWlUqFpZq?= =?us-ascii?Q?fE31FK6UHiIGz8L7OzjiPUVbl+5sJ/msl77f2GC7ZLerseWKPTDk/7nLXrZx?= =?us-ascii?Q?FRDRlJqc539bzcG42TV95gYz9hrl4WTy+MVWE10iIlMQCZBoNY0aYmUEl7u6?= =?us-ascii?Q?3RpSZjtpupFBFL1EuuZUM44stW+mnEos6tC2dN/BR7CxEzVLJkDRoJAys96r?= =?us-ascii?Q?IFgydKjrUdnTDCkjhgRsOdPR/umD7pun2HFbIM3OSsaVeNMvw7wTAGwPOMQE?= =?us-ascii?Q?RWQvP8noBsRVyBKz9tU9rsuS1HjOv0eQOP+IKU6iMLOuL40V6Axg5yJbZrNp?= =?us-ascii?Q?KbML0x+9I1rQGx/JenGBkhqpud/dZSOY1MHnJkyNh27i7fxXCYxDMX0Bt9Wy?= =?us-ascii?Q?AI+/Apl+gBwXEFO/HNDIXGQ4rvvPcGkRomAW3mWmXgUFjGm3r6I/pfiz4K/y?= =?us-ascii?Q?XaUjZv0FqjpaNsQ6iGUy8dsD9CVa?= X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:CY5PR12MB6405.namprd12.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(1800799024)(366016)(376014)(7053199007);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?us-ascii?Q?u6kQleioitfLsNZrNLADcVIc4ZLdPiCnvan6IQm9F/RHH4cUh2BKA3n9GS0s?= =?us-ascii?Q?vv5oBhpVzdp0Eu2XVjAzRxn0BytrBxGpO3j1C7YvxMpETWgX9V0wu3YWzQps?= =?us-ascii?Q?ebCbamrOCwmOVmS10vKH7yZjTZp3U52QGTQxIjs9ok4rCnwK0lbyy6ojVMsa?= =?us-ascii?Q?EM1zBUDrGvQFOE2yH5e1ktPVIIJERENhPDFf1h2Zr52hjaaXIzup8iS9HEUF?= =?us-ascii?Q?sPT02UMCCoMjBSdT5Cr5PWzT59Oxq9fg1MKSgRuFyvANIHJF2SrmkOsMbKA7?= =?us-ascii?Q?YsOBEpW08olFA89BVBzNUl1a1A1u6yh63FIayP3iGWzs9aTPh+qmciiZuazX?= =?us-ascii?Q?TcrGKiqd0m9TgAZDiLDCzObtlqvSgormqk4zV+rblxstwexcNwHT+8PwaDdN?= =?us-ascii?Q?5m36d+QOdomcexLBkZYxjCAzvlISoj/x/g7WuJ15d/WlbDliNf0fjWGp8FyL?= =?us-ascii?Q?Wrelinv2qlsAq9khlKNxIrUzgQRy4CgujmQw9pGDNmJO88q6VGNYObubD8uS?= =?us-ascii?Q?ysFXeB4BeQe8Qh9K3vkSv4joNgN/1Ev0YPIBMe0m0RnK3IwQrX9sqAsgjNjI?= =?us-ascii?Q?7ZvNtT18AlNjEBm5q/aZHEJpQPSQWAnTQvSxUdlI7yaBFuriqWGXORZ0+8bA?= =?us-ascii?Q?7t47O9JmgAwV/fzJu64ejXfS7BJQi6mKLMKig+tMlt52xuMYZTDVtXX3OfIg?= =?us-ascii?Q?ioUrQckOW0K5nO6eaRAEKrZmPTemmv5RMOt37cdQb3f95RCUINsCJ71+b1GR?= =?us-ascii?Q?148V8FU/tT8L1tFQao0LAAnPezvRTU1g0wj8mVGf2ylu/qmSrQ+8hmQqCcXT?= =?us-ascii?Q?YQNRNV4THFucpXdnIY4t5iZHZ++ZgEbseCTNHJHFMXRNqeNJJV/K/kbjQzZc?= =?us-ascii?Q?66difmX8xZlebmzLNddMejuumWhAajgmvwvVS/5cAhMvRa8oLC68h5yvL33W?= =?us-ascii?Q?oDMYAWzbWOyMiTcBVuMBS5YOMA9Zzju3ylJNvcGp9tKu1186vB2f2M8w+nu3?= =?us-ascii?Q?It4tU+diCX0QIVQHvMeSihjilF4YlHeCdfg8tfjnzGJJQ3jFPS577Wu3owil?= =?us-ascii?Q?dGr//hWiiqCILL9Ra/LP4JtUTWz4AkhtwQrtym1ae8UnnT5VN3gPtkKsp6lg?= =?us-ascii?Q?/Jnr6gkTyw/eAkcaASfeKV2u4bMIOwtEhfbpjIkArJxuSiaALaaeVjDK4ir9?= =?us-ascii?Q?Nn6bqbyi9FV2x+/Fv8oTtWBmmymTD5Yc1xlMe3BRMHLg8t+xhB/bdzEOhzRS?= =?us-ascii?Q?TVg571jNemwdIUq3wsf2HjMNRlyfUyQsxgUPeAL1LALzgHqKZ/EjlLjgfwnH?= =?us-ascii?Q?GLZtHlYYqAim+BcnHq38aQmjjuzXUpaW2KnoLubrIpA21PqWZp763DfYlBZk?= =?us-ascii?Q?mjLI0gReY2mJAAFCX0AUqpDdi+xbs50Ij96iM0HqagwCzZRAX/ZOdRxKSAOn?= =?us-ascii?Q?i3ZTQ+5b75XC1y6ZUeNluCou5Tct56V2on2J1vxzgoHRdnEsef6hB73LwvQI?= =?us-ascii?Q?NLLMHpZKqC447xC/AXKfxsSt9LKfTCc6YiUaVYe2zPZgjsmen5nVg0/vBOoL?= =?us-ascii?Q?NwevrfeUo476DTQIqke7K4c+m+l/ZG/jDUttyAoZ?= X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-Network-Message-Id: 00359a56-a078-4059-4f61-08dd3f21c5ec X-MS-Exchange-CrossTenant-AuthSource: CY5PR12MB6405.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 27 Jan 2025 22:27:24.6138 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: Q7uxF04Ibo5i4AVIKJN13RZNBLrE0j1Obwx3dk3lPcCX17jIOZdTtxJmwaI9+1CJOHGgDu8toXk5juFtazlGUA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM3PR12MB9435 Content-Type: text/plain; charset="utf-8" As ext.c is becoming quite large, move the idle CPU selection policy to separate files (ext_idle.c / ext_idle.h) for better code readability. Moreover, group together all the idle CPU selection kfunc's to the same btf_kfunc_id_set block. No functional changes, this is purely code reorganization. Suggested-by: Yury Norov Signed-off-by: Andrea Righi --- MAINTAINERS | 3 +- kernel/sched/build_policy.c | 1 + kernel/sched/ext.c | 739 +---------------------------------- kernel/sched/ext_idle.c | 752 ++++++++++++++++++++++++++++++++++++ kernel/sched/ext_idle.h | 39 ++ 5 files changed, 808 insertions(+), 726 deletions(-) create mode 100644 kernel/sched/ext_idle.c create mode 100644 kernel/sched/ext_idle.h ChangeLog v2 -> v3: - make the kfunc id sets static and introduce scx_idle_init() to register them ChangeLog v1 -> v2: - move idle kfuncs to a separate btf ids set (scx_kfunc_set_idle) - declare idle functions as extern prototypes with an "scx_idle_" prefix diff --git a/MAINTAINERS b/MAINTAINERS index 4e3a9c3fcb8c..021ef04afa00 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -21006,8 +21006,7 @@ S: Maintained W: https://github.com/sched-ext/scx T: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git F: include/linux/sched/ext.h -F: kernel/sched/ext.h -F: kernel/sched/ext.c +F: kernel/sched/ext* F: tools/sched_ext/ F: tools/testing/selftests/sched_ext =20 diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c index fae1f5c921eb..72d97aa8b726 100644 --- a/kernel/sched/build_policy.c +++ b/kernel/sched/build_policy.c @@ -61,6 +61,7 @@ =20 #ifdef CONFIG_SCHED_CLASS_EXT # include "ext.c" +# include "ext_idle.c" #endif =20 #include "syscalls.c" diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 413f13c88699..6be6a8c0ce2e 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -6,6 +6,9 @@ * Copyright (c) 2022 Tejun Heo * Copyright (c) 2022 David Vernet */ +#include +#include "ext_idle.h" + #define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void = (*)(void))) =20 enum scx_consts { @@ -883,12 +886,6 @@ static bool scx_warned_zero_slice; static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last); static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting); static DEFINE_STATIC_KEY_FALSE(scx_ops_cpu_preempt); -static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled); - -#ifdef CONFIG_SMP -static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc); -static DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa); -#endif =20 static struct static_key_false scx_has_op[SCX_OPI_END] =3D { [0 ... SCX_OPI_END-1] =3D STATIC_KEY_FALSE_INIT }; @@ -923,21 +920,6 @@ static unsigned long scx_watchdog_timestamp =3D INITIA= L_JIFFIES; =20 static struct delayed_work scx_watchdog_work; =20 -/* idle tracking */ -#ifdef CONFIG_SMP -#ifdef CONFIG_CPUMASK_OFFSTACK -#define CL_ALIGNED_IF_ONSTACK -#else -#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp -#endif - -static struct { - cpumask_var_t cpu; - cpumask_var_t smt; -} idle_masks CL_ALIGNED_IF_ONSTACK; - -#endif /* CONFIG_SMP */ - /* for %SCX_KICK_WAIT */ static unsigned long __percpu *scx_kick_cpus_pnt_seqs; =20 @@ -3175,416 +3157,6 @@ bool scx_prio_less(const struct task_struct *a, con= st struct task_struct *b, =20 #ifdef CONFIG_SMP =20 -static bool test_and_clear_cpu_idle(int cpu) -{ -#ifdef CONFIG_SCHED_SMT - /* - * SMT mask should be cleared whether we can claim @cpu or not. The SMT - * cluster is not wholly idle either way. This also prevents - * scx_pick_idle_cpu() from getting caught in an infinite loop. - */ - if (sched_smt_active()) { - const struct cpumask *smt =3D cpu_smt_mask(cpu); - - /* - * If offline, @cpu is not its own sibling and - * scx_pick_idle_cpu() can get caught in an infinite loop as - * @cpu is never cleared from idle_masks.smt. Ensure that @cpu - * is eventually cleared. - * - * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to - * reduce memory writes, which may help alleviate cache - * coherence pressure. - */ - if (cpumask_intersects(smt, idle_masks.smt)) - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); - else if (cpumask_test_cpu(cpu, idle_masks.smt)) - __cpumask_clear_cpu(cpu, idle_masks.smt); - } -#endif - return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); -} - -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) -{ - int cpu; - -retry: - if (sched_smt_active()) { - cpu =3D cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); - if (cpu < nr_cpu_ids) - goto found; - - if (flags & SCX_PICK_IDLE_CORE) - return -EBUSY; - } - - cpu =3D cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); - if (cpu >=3D nr_cpu_ids) - return -EBUSY; - -found: - if (test_and_clear_cpu_idle(cpu)) - return cpu; - else - goto retry; -} - -/* - * Return the amount of CPUs in the same LLC domain of @cpu (or zero if th= e LLC - * domain is not defined). - */ -static unsigned int llc_weight(s32 cpu) -{ - struct sched_domain *sd; - - sd =3D rcu_dereference(per_cpu(sd_llc, cpu)); - if (!sd) - return 0; - - return sd->span_weight; -} - -/* - * Return the cpumask representing the LLC domain of @cpu (or NULL if the = LLC - * domain is not defined). - */ -static struct cpumask *llc_span(s32 cpu) -{ - struct sched_domain *sd; - - sd =3D rcu_dereference(per_cpu(sd_llc, cpu)); - if (!sd) - return 0; - - return sched_domain_span(sd); -} - -/* - * Return the amount of CPUs in the same NUMA domain of @cpu (or zero if t= he - * NUMA domain is not defined). - */ -static unsigned int numa_weight(s32 cpu) -{ - struct sched_domain *sd; - struct sched_group *sg; - - sd =3D rcu_dereference(per_cpu(sd_numa, cpu)); - if (!sd) - return 0; - sg =3D sd->groups; - if (!sg) - return 0; - - return sg->group_weight; -} - -/* - * Return the cpumask representing the NUMA domain of @cpu (or NULL if the= NUMA - * domain is not defined). - */ -static struct cpumask *numa_span(s32 cpu) -{ - struct sched_domain *sd; - struct sched_group *sg; - - sd =3D rcu_dereference(per_cpu(sd_numa, cpu)); - if (!sd) - return NULL; - sg =3D sd->groups; - if (!sg) - return NULL; - - return sched_group_span(sg); -} - -/* - * Return true if the LLC domains do not perfectly overlap with the NUMA - * domains, false otherwise. - */ -static bool llc_numa_mismatch(void) -{ - int cpu; - - /* - * We need to scan all online CPUs to verify whether their scheduling - * domains overlap. - * - * While it is rare to encounter architectures with asymmetric NUMA - * topologies, CPU hotplugging or virtualized environments can result - * in asymmetric configurations. - * - * For example: - * - * NUMA 0: - * - LLC 0: cpu0..cpu7 - * - LLC 1: cpu8..cpu15 [offline] - * - * NUMA 1: - * - LLC 0: cpu16..cpu23 - * - LLC 1: cpu24..cpu31 - * - * In this case, if we only check the first online CPU (cpu0), we might - * incorrectly assume that the LLC and NUMA domains are fully - * overlapping, which is incorrect (as NUMA 1 has two distinct LLC - * domains). - */ - for_each_online_cpu(cpu) - if (llc_weight(cpu) !=3D numa_weight(cpu)) - return true; - - return false; -} - -/* - * Initialize topology-aware scheduling. - * - * Detect if the system has multiple LLC or multiple NUMA domains and enab= le - * cache-aware / NUMA-aware scheduling optimizations in the default CPU id= le - * selection policy. - * - * Assumption: the kernel's internal topology representation assumes that = each - * CPU belongs to a single LLC domain, and that each LLC domain is entirely - * contained within a single NUMA node. - */ -static void update_selcpu_topology(void) -{ - bool enable_llc =3D false, enable_numa =3D false; - unsigned int nr_cpus; - s32 cpu =3D cpumask_first(cpu_online_mask); - - /* - * Enable LLC domain optimization only when there are multiple LLC - * domains among the online CPUs. If all online CPUs are part of a - * single LLC domain, the idle CPU selection logic can choose any - * online CPU without bias. - * - * Note that it is sufficient to check the LLC domain of the first - * online CPU to determine whether a single LLC domain includes all - * CPUs. - */ - rcu_read_lock(); - nr_cpus =3D llc_weight(cpu); - if (nr_cpus > 0) { - if (nr_cpus < num_online_cpus()) - enable_llc =3D true; - pr_debug("sched_ext: LLC=3D%*pb weight=3D%u\n", - cpumask_pr_args(llc_span(cpu)), llc_weight(cpu)); - } - - /* - * Enable NUMA optimization only when there are multiple NUMA domains - * among the online CPUs and the NUMA domains don't perfectly overlaps - * with the LLC domains. - * - * If all CPUs belong to the same NUMA node and the same LLC domain, - * enabling both NUMA and LLC optimizations is unnecessary, as checking - * for an idle CPU in the same domain twice is redundant. - */ - nr_cpus =3D numa_weight(cpu); - if (nr_cpus > 0) { - if (nr_cpus < num_online_cpus() && llc_numa_mismatch()) - enable_numa =3D true; - pr_debug("sched_ext: NUMA=3D%*pb weight=3D%u\n", - cpumask_pr_args(numa_span(cpu)), numa_weight(cpu)); - } - rcu_read_unlock(); - - pr_debug("sched_ext: LLC idle selection %s\n", - str_enabled_disabled(enable_llc)); - pr_debug("sched_ext: NUMA idle selection %s\n", - str_enabled_disabled(enable_numa)); - - if (enable_llc) - static_branch_enable_cpuslocked(&scx_selcpu_topo_llc); - else - static_branch_disable_cpuslocked(&scx_selcpu_topo_llc); - if (enable_numa) - static_branch_enable_cpuslocked(&scx_selcpu_topo_numa); - else - static_branch_disable_cpuslocked(&scx_selcpu_topo_numa); -} - -/* - * Built-in CPU idle selection policy: - * - * 1. Prioritize full-idle cores: - * - always prioritize CPUs from fully idle cores (both logical CPUs are - * idle) to avoid interference caused by SMT. - * - * 2. Reuse the same CPU: - * - prefer the last used CPU to take advantage of cached data (L1, L2) = and - * branch prediction optimizations. - * - * 3. Pick a CPU within the same LLC (Last-Level Cache): - * - if the above conditions aren't met, pick a CPU that shares the same= LLC - * to maintain cache locality. - * - * 4. Pick a CPU within the same NUMA node, if enabled: - * - choose a CPU from the same NUMA node to reduce memory access latenc= y. - * - * 5. Pick any idle CPU usable by the task. - * - * Step 3 and 4 are performed only if the system has, respectively, multip= le - * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and - * scx_selcpu_topo_numa). - * - * NOTE: tasks that can only run on 1 CPU are excluded by this logic, beca= use - * we never call ops.select_cpu() for them, see select_task_rq(). - */ -static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, - u64 wake_flags, bool *found) -{ - const struct cpumask *llc_cpus =3D NULL; - const struct cpumask *numa_cpus =3D NULL; - s32 cpu; - - *found =3D false; - - /* - * This is necessary to protect llc_cpus. - */ - rcu_read_lock(); - - /* - * Determine the scheduling domain only if the task is allowed to run - * on all CPUs. - * - * This is done primarily for efficiency, as it avoids the overhead of - * updating a cpumask every time we need to select an idle CPU (which - * can be costly in large SMP systems), but it also aligns logically: - * if a task's scheduling domain is restricted by user-space (through - * CPU affinity), the task will simply use the flat scheduling domain - * defined by user-space. - */ - if (p->nr_cpus_allowed >=3D num_possible_cpus()) { - if (static_branch_maybe(CONFIG_NUMA, &scx_selcpu_topo_numa)) - numa_cpus =3D numa_span(prev_cpu); - - if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc)) - llc_cpus =3D llc_span(prev_cpu); - } - - /* - * If WAKE_SYNC, try to migrate the wakee to the waker's CPU. - */ - if (wake_flags & SCX_WAKE_SYNC) { - cpu =3D smp_processor_id(); - - /* - * If the waker's CPU is cache affine and prev_cpu is idle, - * then avoid a migration. - */ - if (cpus_share_cache(cpu, prev_cpu) && - test_and_clear_cpu_idle(prev_cpu)) { - cpu =3D prev_cpu; - goto cpu_found; - } - - /* - * If the waker's local DSQ is empty, and the system is under - * utilized, try to wake up @p to the local DSQ of the waker. - * - * Checking only for an empty local DSQ is insufficient as it - * could give the wakee an unfair advantage when the system is - * oversaturated. - * - * Checking only for the presence of idle CPUs is also - * insufficient as the local DSQ of the waker could have tasks - * piled up on it even if there is an idle core elsewhere on - * the system. - */ - if (!cpumask_empty(idle_masks.cpu) && - !(current->flags & PF_EXITING) && - cpu_rq(cpu)->scx.local_dsq.nr =3D=3D 0) { - if (cpumask_test_cpu(cpu, p->cpus_ptr)) - goto cpu_found; - } - } - - /* - * If CPU has SMT, any wholly idle CPU is likely a better pick than - * partially idle @prev_cpu. - */ - if (sched_smt_active()) { - /* - * Keep using @prev_cpu if it's part of a fully idle core. - */ - if (cpumask_test_cpu(prev_cpu, idle_masks.smt) && - test_and_clear_cpu_idle(prev_cpu)) { - cpu =3D prev_cpu; - goto cpu_found; - } - - /* - * Search for any fully idle core in the same LLC domain. - */ - if (llc_cpus) { - cpu =3D scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE); - if (cpu >=3D 0) - goto cpu_found; - } - - /* - * Search for any fully idle core in the same NUMA node. - */ - if (numa_cpus) { - cpu =3D scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE); - if (cpu >=3D 0) - goto cpu_found; - } - - /* - * Search for any full idle core usable by the task. - */ - cpu =3D scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE); - if (cpu >=3D 0) - goto cpu_found; - } - - /* - * Use @prev_cpu if it's idle. - */ - if (test_and_clear_cpu_idle(prev_cpu)) { - cpu =3D prev_cpu; - goto cpu_found; - } - - /* - * Search for any idle CPU in the same LLC domain. - */ - if (llc_cpus) { - cpu =3D scx_pick_idle_cpu(llc_cpus, 0); - if (cpu >=3D 0) - goto cpu_found; - } - - /* - * Search for any idle CPU in the same NUMA node. - */ - if (numa_cpus) { - cpu =3D scx_pick_idle_cpu(numa_cpus, 0); - if (cpu >=3D 0) - goto cpu_found; - } - - /* - * Search for any idle CPU usable by the task. - */ - cpu =3D scx_pick_idle_cpu(p->cpus_ptr, 0); - if (cpu >=3D 0) - goto cpu_found; - - rcu_read_unlock(); - return prev_cpu; - -cpu_found: - rcu_read_unlock(); - - *found =3D true; - return cpu; -} - static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wak= e_flags) { /* @@ -3651,90 +3223,6 @@ static void set_cpus_allowed_scx(struct task_struct = *p, (struct cpumask *)p->cpus_ptr); } =20 -static void reset_idle_masks(void) -{ - /* - * Consider all online cpus idle. Should converge to the actual state - * quickly. - */ - cpumask_copy(idle_masks.cpu, cpu_online_mask); - cpumask_copy(idle_masks.smt, cpu_online_mask); -} - -static void update_builtin_idle(int cpu, bool idle) -{ - assign_cpu(cpu, idle_masks.cpu, idle); - -#ifdef CONFIG_SCHED_SMT - if (sched_smt_active()) { - const struct cpumask *smt =3D cpu_smt_mask(cpu); - - if (idle) { - /* - * idle_masks.smt handling is racy but that's fine as - * it's only for optimization and self-correcting. - */ - if (!cpumask_subset(smt, idle_masks.cpu)) - return; - cpumask_or(idle_masks.smt, idle_masks.smt, smt); - } else { - cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); - } - } -#endif -} - -/* - * Update the idle state of a CPU to @idle. - * - * If @do_notify is true, ops.update_idle() is invoked to notify the scx - * scheduler of an actual idle state transition (idle to busy or vice - * versa). If @do_notify is false, only the idle state in the idle masks is - * refreshed without invoking ops.update_idle(). - * - * This distinction is necessary, because an idle CPU can be "reserved" and - * awakened via scx_bpf_pick_idle_cpu() + scx_bpf_kick_cpu(), marking it as - * busy even if no tasks are dispatched. In this case, the CPU may return - * to idle without a true state transition. Refreshing the idle masks - * without invoking ops.update_idle() ensures accurate idle state tracking - * while avoiding unnecessary updates and maintaining balanced state - * transitions. - */ -void __scx_update_idle(struct rq *rq, bool idle, bool do_notify) -{ - int cpu =3D cpu_of(rq); - - lockdep_assert_rq_held(rq); - - /* - * Trigger ops.update_idle() only when transitioning from a task to - * the idle thread and vice versa. - * - * Idle transitions are indicated by do_notify being set to true, - * managed by put_prev_task_idle()/set_next_task_idle(). - */ - if (SCX_HAS_OP(update_idle) && do_notify && !scx_rq_bypassing(rq)) - SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle); - - /* - * Update the idle masks: - * - for real idle transitions (do_notify =3D=3D true) - * - for idle-to-idle transitions (indicated by the previous task - * being the idle thread, managed by pick_task_idle()) - * - * Skip updating idle masks if the previous task is not the idle - * thread, since set_next_task_idle() has already handled it when - * transitioning from a task to the idle thread (calling this - * function with do_notify =3D=3D true). - * - * In this way we can avoid updating the idle masks twice, - * unnecessarily. - */ - if (static_branch_likely(&scx_builtin_idle_enabled)) - if (do_notify || is_idle_task(rq->curr)) - update_builtin_idle(cpu, idle); -} - static void handle_hotplug(struct rq *rq, bool online) { int cpu =3D cpu_of(rq); @@ -3742,7 +3230,7 @@ static void handle_hotplug(struct rq *rq, bool online) atomic_long_inc(&scx_hotplug_seq); =20 if (scx_enabled()) - update_selcpu_topology(); + scx_idle_update_selcpu_topology(); =20 if (online && SCX_HAS_OP(cpu_online)) SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu); @@ -3774,12 +3262,6 @@ static void rq_offline_scx(struct rq *rq) rq->scx.flags &=3D ~SCX_RQ_ONLINE; } =20 -#else /* CONFIG_SMP */ - -static bool test_and_clear_cpu_idle(int cpu) { return false; } -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags= ) { return -EBUSY; } -static void reset_idle_masks(void) {} - #endif /* CONFIG_SMP */ =20 static bool check_rq_for_timeouts(struct rq *rq) @@ -5625,9 +5107,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops, = struct bpf_link *link) static_branch_enable_cpuslocked(&scx_has_op[i]); =20 check_hotplug_seq(ops); -#ifdef CONFIG_SMP - update_selcpu_topology(); -#endif + scx_idle_update_selcpu_topology(); + cpus_read_unlock(); =20 ret =3D validate_ops(ops); @@ -5675,7 +5156,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, = struct bpf_link *link) static_branch_enable(&scx_ops_cpu_preempt); =20 if (!ops->update_idle || (ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE)) { - reset_idle_masks(); + scx_idle_reset_masks(); static_branch_enable(&scx_builtin_idle_enabled); } else { static_branch_disable(&scx_builtin_idle_enabled); @@ -6318,10 +5799,8 @@ void __init init_sched_ext_class(void) SCX_TG_ONLINE); =20 BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params)); -#ifdef CONFIG_SMP - BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); - BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); -#endif + scx_idle_init_masks(); + scx_kick_cpus_pnt_seqs =3D __alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids, __alignof__(scx_kick_cpus_pnt_seqs[0])); @@ -6354,62 +5833,6 @@ void __init init_sched_ext_class(void) /*************************************************************************= ******* * Helpers that can be called from the BPF scheduler. */ -#include - -__bpf_kfunc_start_defs(); - -static bool check_builtin_idle_enabled(void) -{ - if (static_branch_likely(&scx_builtin_idle_enabled)) - return true; - - scx_ops_error("built-in idle tracking is disabled"); - return false; -} - -/** - * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu() - * @p: task_struct to select a CPU for - * @prev_cpu: CPU @p was on previously - * @wake_flags: %SCX_WAKE_* flags - * @is_idle: out parameter indicating whether the returned CPU is idle - * - * Can only be called from ops.select_cpu() if the built-in CPU selection = is - * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is= set. - * @p, @prev_cpu and @wake_flags match ops.select_cpu(). - * - * Returns the picked CPU with *@is_idle indicating whether the picked CPU= is - * currently idle and thus a good candidate for direct dispatching. - */ -__bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, - u64 wake_flags, bool *is_idle) -{ - if (!check_builtin_idle_enabled()) - goto prev_cpu; - - if (!scx_kf_allowed(SCX_KF_SELECT_CPU)) - goto prev_cpu; - -#ifdef CONFIG_SMP - return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle); -#endif - -prev_cpu: - *is_idle =3D false; - return prev_cpu; -} - -__bpf_kfunc_end_defs(); - -BTF_KFUNCS_START(scx_kfunc_ids_select_cpu) -BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU) -BTF_KFUNCS_END(scx_kfunc_ids_select_cpu) - -static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu =3D { - .owner =3D THIS_MODULE, - .set =3D &scx_kfunc_ids_select_cpu, -}; - static bool scx_dsq_insert_preamble(struct task_struct *p, u64 enq_flags) { if (!scx_kf_allowed(SCX_KF_ENQUEUE | SCX_KF_DISPATCH)) @@ -7468,142 +6891,6 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct c= pumask *cpumask) */ } =20 -/** - * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking - * per-CPU cpumask. - * - * Returns NULL if idle tracking is not enabled, or running on a UP kernel. - */ -__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) -{ - if (!check_builtin_idle_enabled()) - return cpu_none_mask; - -#ifdef CONFIG_SMP - return idle_masks.cpu; -#else - return cpu_none_mask; -#endif -} - -/** - * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking, - * per-physical-core cpumask. Can be used to determine if an entire physic= al - * core is free. - * - * Returns NULL if idle tracking is not enabled, or running on a UP kernel. - */ -__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void) -{ - if (!check_builtin_idle_enabled()) - return cpu_none_mask; - -#ifdef CONFIG_SMP - if (sched_smt_active()) - return idle_masks.smt; - else - return idle_masks.cpu; -#else - return cpu_none_mask; -#endif -} - -/** - * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kpt= r to - * either the percpu, or SMT idle-tracking cpumask. - * @idle_mask: &cpumask to use - */ -__bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask) -{ - /* - * Empty function body because we aren't actually acquiring or releasing - * a reference to a global idle cpumask, which is read-only in the - * caller and is never released. The acquire / release semantics here - * are just used to make the cpumask a trusted pointer in the caller. - */ -} - -/** - * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state - * @cpu: cpu to test and clear idle for - * - * Returns %true if @cpu was idle and its idle state was successfully clea= red. - * %false otherwise. - * - * Unavailable if ops.update_idle() is implemented and - * %SCX_OPS_KEEP_BUILTIN_IDLE is not set. - */ -__bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) -{ - if (!check_builtin_idle_enabled()) - return false; - - if (ops_cpu_valid(cpu, NULL)) - return test_and_clear_cpu_idle(cpu); - else - return false; -} - -/** - * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu - * @cpus_allowed: Allowed cpumask - * @flags: %SCX_PICK_IDLE_CPU_* flags - * - * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu - * number on success. -%EBUSY if no matching cpu was found. - * - * Idle CPU tracking may race against CPU scheduling state transitions. For - * example, this function may return -%EBUSY as CPUs are transitioning int= o the - * idle state. If the caller then assumes that there will be dispatch even= ts on - * the CPUs as they were all busy, the scheduler may end up stalling with = CPUs - * idling while there are pending tasks. Use scx_bpf_pick_any_cpu() and - * scx_bpf_kick_cpu() to guarantee that there will be at least one dispatch - * event in the near future. - * - * Unavailable if ops.update_idle() is implemented and - * %SCX_OPS_KEEP_BUILTIN_IDLE is not set. - */ -__bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed, - u64 flags) -{ - if (!check_builtin_idle_enabled()) - return -EBUSY; - - return scx_pick_idle_cpu(cpus_allowed, flags); -} - -/** - * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick = any CPU - * @cpus_allowed: Allowed cpumask - * @flags: %SCX_PICK_IDLE_CPU_* flags - * - * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick= any - * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle= cpu - * number if @cpus_allowed is not empty. -%EBUSY is returned if @cpus_allo= wed is - * empty. - * - * If ops.update_idle() is implemented and %SCX_OPS_KEEP_BUILTIN_IDLE is n= ot - * set, this function can't tell which CPUs are idle and will always pick = any - * CPU. - */ -__bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed, - u64 flags) -{ - s32 cpu; - - if (static_branch_likely(&scx_builtin_idle_enabled)) { - cpu =3D scx_pick_idle_cpu(cpus_allowed, flags); - if (cpu >=3D 0) - return cpu; - } - - cpu =3D cpumask_any_distribute(cpus_allowed); - if (cpu < nr_cpu_ids) - return cpu; - else - return -EBUSY; -} - /** * scx_bpf_task_running - Is task currently running? * @p: task of interest @@ -7779,8 +7066,6 @@ static int __init scx_init(void) * check using scx_kf_allowed(). */ if ((ret =3D register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, - &scx_kfunc_set_select_cpu)) || - (ret =3D register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_enqueue_dispatch)) || (ret =3D register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_dispatch)) || @@ -7800,6 +7085,12 @@ static int __init scx_init(void) return ret; } =20 + ret =3D scx_idle_init(); + if (ret) { + pr_err("sched_ext: Failed to initialize idle tracking (%d)\n", ret); + return ret; + } + ret =3D register_bpf_struct_ops(&bpf_sched_ext_ops, sched_ext_ops); if (ret) { pr_err("sched_ext: Failed to register struct_ops (%d)\n", ret); diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c new file mode 100644 index 000000000000..cb981956005b --- /dev/null +++ b/kernel/sched/ext_idle.c @@ -0,0 +1,752 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst + * + * Built-in idle CPU tracking policy. + * + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo + * Copyright (c) 2022 David Vernet + * Copyright (c) 2024 Andrea Righi + */ +#include "ext_idle.h" + +/* Enable/disable built-in idle CPU selection policy */ +DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled); + +#ifdef CONFIG_SMP +#ifdef CONFIG_CPUMASK_OFFSTACK +#define CL_ALIGNED_IF_ONSTACK +#else +#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp +#endif + +/* Enable/disable LLC aware optimizations */ +DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_llc); + +/* Enable/disable NUMA aware optimizations */ +DEFINE_STATIC_KEY_FALSE(scx_selcpu_topo_numa); + +static struct { + cpumask_var_t cpu; + cpumask_var_t smt; +} idle_masks CL_ALIGNED_IF_ONSTACK; + +bool scx_idle_test_and_clear_cpu(int cpu) +{ +#ifdef CONFIG_SCHED_SMT + /* + * SMT mask should be cleared whether we can claim @cpu or not. The SMT + * cluster is not wholly idle either way. This also prevents + * scx_pick_idle_cpu() from getting caught in an infinite loop. + */ + if (sched_smt_active()) { + const struct cpumask *smt =3D cpu_smt_mask(cpu); + + /* + * If offline, @cpu is not its own sibling and + * scx_pick_idle_cpu() can get caught in an infinite loop as + * @cpu is never cleared from idle_masks.smt. Ensure that @cpu + * is eventually cleared. + * + * NOTE: Use cpumask_intersects() and cpumask_test_cpu() to + * reduce memory writes, which may help alleviate cache + * coherence pressure. + */ + if (cpumask_intersects(smt, idle_masks.smt)) + cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); + else if (cpumask_test_cpu(cpu, idle_masks.smt)) + __cpumask_clear_cpu(cpu, idle_masks.smt); + } +#endif + return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu); +} + +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) +{ + int cpu; + +retry: + if (sched_smt_active()) { + cpu =3D cpumask_any_and_distribute(idle_masks.smt, cpus_allowed); + if (cpu < nr_cpu_ids) + goto found; + + if (flags & SCX_PICK_IDLE_CORE) + return -EBUSY; + } + + cpu =3D cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed); + if (cpu >=3D nr_cpu_ids) + return -EBUSY; + +found: + if (scx_idle_test_and_clear_cpu(cpu)) + return cpu; + else + goto retry; +} + +/* + * Return the amount of CPUs in the same LLC domain of @cpu (or zero if th= e LLC + * domain is not defined). + */ +static unsigned int llc_weight(s32 cpu) +{ + struct sched_domain *sd; + + sd =3D rcu_dereference(per_cpu(sd_llc, cpu)); + if (!sd) + return 0; + + return sd->span_weight; +} + +/* + * Return the cpumask representing the LLC domain of @cpu (or NULL if the = LLC + * domain is not defined). + */ +static struct cpumask *llc_span(s32 cpu) +{ + struct sched_domain *sd; + + sd =3D rcu_dereference(per_cpu(sd_llc, cpu)); + if (!sd) + return 0; + + return sched_domain_span(sd); +} + +/* + * Return the amount of CPUs in the same NUMA domain of @cpu (or zero if t= he + * NUMA domain is not defined). + */ +static unsigned int numa_weight(s32 cpu) +{ + struct sched_domain *sd; + struct sched_group *sg; + + sd =3D rcu_dereference(per_cpu(sd_numa, cpu)); + if (!sd) + return 0; + sg =3D sd->groups; + if (!sg) + return 0; + + return sg->group_weight; +} + +/* + * Return the cpumask representing the NUMA domain of @cpu (or NULL if the= NUMA + * domain is not defined). + */ +static struct cpumask *numa_span(s32 cpu) +{ + struct sched_domain *sd; + struct sched_group *sg; + + sd =3D rcu_dereference(per_cpu(sd_numa, cpu)); + if (!sd) + return NULL; + sg =3D sd->groups; + if (!sg) + return NULL; + + return sched_group_span(sg); +} + +/* + * Return true if the LLC domains do not perfectly overlap with the NUMA + * domains, false otherwise. + */ +static bool llc_numa_mismatch(void) +{ + int cpu; + + /* + * We need to scan all online CPUs to verify whether their scheduling + * domains overlap. + * + * While it is rare to encounter architectures with asymmetric NUMA + * topologies, CPU hotplugging or virtualized environments can result + * in asymmetric configurations. + * + * For example: + * + * NUMA 0: + * - LLC 0: cpu0..cpu7 + * - LLC 1: cpu8..cpu15 [offline] + * + * NUMA 1: + * - LLC 0: cpu16..cpu23 + * - LLC 1: cpu24..cpu31 + * + * In this case, if we only check the first online CPU (cpu0), we might + * incorrectly assume that the LLC and NUMA domains are fully + * overlapping, which is incorrect (as NUMA 1 has two distinct LLC + * domains). + */ + for_each_online_cpu(cpu) + if (llc_weight(cpu) !=3D numa_weight(cpu)) + return true; + + return false; +} + +/* + * Initialize topology-aware scheduling. + * + * Detect if the system has multiple LLC or multiple NUMA domains and enab= le + * cache-aware / NUMA-aware scheduling optimizations in the default CPU id= le + * selection policy. + * + * Assumption: the kernel's internal topology representation assumes that = each + * CPU belongs to a single LLC domain, and that each LLC domain is entirely + * contained within a single NUMA node. + */ +void scx_idle_update_selcpu_topology(void) +{ + bool enable_llc =3D false, enable_numa =3D false; + unsigned int nr_cpus; + s32 cpu =3D cpumask_first(cpu_online_mask); + + /* + * Enable LLC domain optimization only when there are multiple LLC + * domains among the online CPUs. If all online CPUs are part of a + * single LLC domain, the idle CPU selection logic can choose any + * online CPU without bias. + * + * Note that it is sufficient to check the LLC domain of the first + * online CPU to determine whether a single LLC domain includes all + * CPUs. + */ + rcu_read_lock(); + nr_cpus =3D llc_weight(cpu); + if (nr_cpus > 0) { + if (nr_cpus < num_online_cpus()) + enable_llc =3D true; + pr_debug("sched_ext: LLC=3D%*pb weight=3D%u\n", + cpumask_pr_args(llc_span(cpu)), llc_weight(cpu)); + } + + /* + * Enable NUMA optimization only when there are multiple NUMA domains + * among the online CPUs and the NUMA domains don't perfectly overlaps + * with the LLC domains. + * + * If all CPUs belong to the same NUMA node and the same LLC domain, + * enabling both NUMA and LLC optimizations is unnecessary, as checking + * for an idle CPU in the same domain twice is redundant. + */ + nr_cpus =3D numa_weight(cpu); + if (nr_cpus > 0) { + if (nr_cpus < num_online_cpus() && llc_numa_mismatch()) + enable_numa =3D true; + pr_debug("sched_ext: NUMA=3D%*pb weight=3D%u\n", + cpumask_pr_args(numa_span(cpu)), numa_weight(cpu)); + } + rcu_read_unlock(); + + pr_debug("sched_ext: LLC idle selection %s\n", + str_enabled_disabled(enable_llc)); + pr_debug("sched_ext: NUMA idle selection %s\n", + str_enabled_disabled(enable_numa)); + + if (enable_llc) + static_branch_enable_cpuslocked(&scx_selcpu_topo_llc); + else + static_branch_disable_cpuslocked(&scx_selcpu_topo_llc); + if (enable_numa) + static_branch_enable_cpuslocked(&scx_selcpu_topo_numa); + else + static_branch_disable_cpuslocked(&scx_selcpu_topo_numa); +} + +/* + * Built-in CPU idle selection policy: + * + * 1. Prioritize full-idle cores: + * - always prioritize CPUs from fully idle cores (both logical CPUs are + * idle) to avoid interference caused by SMT. + * + * 2. Reuse the same CPU: + * - prefer the last used CPU to take advantage of cached data (L1, L2) = and + * branch prediction optimizations. + * + * 3. Pick a CPU within the same LLC (Last-Level Cache): + * - if the above conditions aren't met, pick a CPU that shares the same= LLC + * to maintain cache locality. + * + * 4. Pick a CPU within the same NUMA node, if enabled: + * - choose a CPU from the same NUMA node to reduce memory access latenc= y. + * + * 5. Pick any idle CPU usable by the task. + * + * Step 3 and 4 are performed only if the system has, respectively, multip= le + * LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and + * scx_selcpu_topo_numa). + * + * NOTE: tasks that can only run on 1 CPU are excluded by this logic, beca= use + * we never call ops.select_cpu() for them, see select_task_rq(). + */ +s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags= , bool *found) +{ + const struct cpumask *llc_cpus =3D NULL; + const struct cpumask *numa_cpus =3D NULL; + s32 cpu; + + *found =3D false; + + /* + * This is necessary to protect llc_cpus. + */ + rcu_read_lock(); + + /* + * Determine the scheduling domain only if the task is allowed to run + * on all CPUs. + * + * This is done primarily for efficiency, as it avoids the overhead of + * updating a cpumask every time we need to select an idle CPU (which + * can be costly in large SMP systems), but it also aligns logically: + * if a task's scheduling domain is restricted by user-space (through + * CPU affinity), the task will simply use the flat scheduling domain + * defined by user-space. + */ + if (p->nr_cpus_allowed >=3D num_possible_cpus()) { + if (static_branch_maybe(CONFIG_NUMA, &scx_selcpu_topo_numa)) + numa_cpus =3D numa_span(prev_cpu); + + if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc)) + llc_cpus =3D llc_span(prev_cpu); + } + + /* + * If WAKE_SYNC, try to migrate the wakee to the waker's CPU. + */ + if (wake_flags & SCX_WAKE_SYNC) { + cpu =3D smp_processor_id(); + + /* + * If the waker's CPU is cache affine and prev_cpu is idle, + * then avoid a migration. + */ + if (cpus_share_cache(cpu, prev_cpu) && + scx_idle_test_and_clear_cpu(prev_cpu)) { + cpu =3D prev_cpu; + goto cpu_found; + } + + /* + * If the waker's local DSQ is empty, and the system is under + * utilized, try to wake up @p to the local DSQ of the waker. + * + * Checking only for an empty local DSQ is insufficient as it + * could give the wakee an unfair advantage when the system is + * oversaturated. + * + * Checking only for the presence of idle CPUs is also + * insufficient as the local DSQ of the waker could have tasks + * piled up on it even if there is an idle core elsewhere on + * the system. + */ + if (!cpumask_empty(idle_masks.cpu) && + !(current->flags & PF_EXITING) && + cpu_rq(cpu)->scx.local_dsq.nr =3D=3D 0) { + if (cpumask_test_cpu(cpu, p->cpus_ptr)) + goto cpu_found; + } + } + + /* + * If CPU has SMT, any wholly idle CPU is likely a better pick than + * partially idle @prev_cpu. + */ + if (sched_smt_active()) { + /* + * Keep using @prev_cpu if it's part of a fully idle core. + */ + if (cpumask_test_cpu(prev_cpu, idle_masks.smt) && + scx_idle_test_and_clear_cpu(prev_cpu)) { + cpu =3D prev_cpu; + goto cpu_found; + } + + /* + * Search for any fully idle core in the same LLC domain. + */ + if (llc_cpus) { + cpu =3D scx_pick_idle_cpu(llc_cpus, SCX_PICK_IDLE_CORE); + if (cpu >=3D 0) + goto cpu_found; + } + + /* + * Search for any fully idle core in the same NUMA node. + */ + if (numa_cpus) { + cpu =3D scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE); + if (cpu >=3D 0) + goto cpu_found; + } + + /* + * Search for any full idle core usable by the task. + */ + cpu =3D scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE); + if (cpu >=3D 0) + goto cpu_found; + } + + /* + * Use @prev_cpu if it's idle. + */ + if (scx_idle_test_and_clear_cpu(prev_cpu)) { + cpu =3D prev_cpu; + goto cpu_found; + } + + /* + * Search for any idle CPU in the same LLC domain. + */ + if (llc_cpus) { + cpu =3D scx_pick_idle_cpu(llc_cpus, 0); + if (cpu >=3D 0) + goto cpu_found; + } + + /* + * Search for any idle CPU in the same NUMA node. + */ + if (numa_cpus) { + cpu =3D scx_pick_idle_cpu(numa_cpus, 0); + if (cpu >=3D 0) + goto cpu_found; + } + + /* + * Search for any idle CPU usable by the task. + */ + cpu =3D scx_pick_idle_cpu(p->cpus_ptr, 0); + if (cpu >=3D 0) + goto cpu_found; + + rcu_read_unlock(); + return prev_cpu; + +cpu_found: + rcu_read_unlock(); + + *found =3D true; + return cpu; +} + +void scx_idle_reset_masks(void) +{ + /* + * Consider all online cpus idle. Should converge to the actual state + * quickly. + */ + cpumask_copy(idle_masks.cpu, cpu_online_mask); + cpumask_copy(idle_masks.smt, cpu_online_mask); +} + +void scx_idle_init_masks(void) +{ + BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL)); + BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL)); +} + +static void update_builtin_idle(int cpu, bool idle) +{ + assign_cpu(cpu, idle_masks.cpu, idle); + +#ifdef CONFIG_SCHED_SMT + if (sched_smt_active()) { + const struct cpumask *smt =3D cpu_smt_mask(cpu); + + if (idle) { + /* + * idle_masks.smt handling is racy but that's fine as + * it's only for optimization and self-correcting. + */ + if (!cpumask_subset(smt, idle_masks.cpu)) + return; + cpumask_or(idle_masks.smt, idle_masks.smt, smt); + } else { + cpumask_andnot(idle_masks.smt, idle_masks.smt, smt); + } + } +#endif +} + +/* + * Update the idle state of a CPU to @idle. + * + * If @do_notify is true, ops.update_idle() is invoked to notify the scx + * scheduler of an actual idle state transition (idle to busy or vice + * versa). If @do_notify is false, only the idle state in the idle masks is + * refreshed without invoking ops.update_idle(). + * + * This distinction is necessary, because an idle CPU can be "reserved" and + * awakened via scx_bpf_pick_idle_cpu() + scx_bpf_kick_cpu(), marking it as + * busy even if no tasks are dispatched. In this case, the CPU may return + * to idle without a true state transition. Refreshing the idle masks + * without invoking ops.update_idle() ensures accurate idle state tracking + * while avoiding unnecessary updates and maintaining balanced state + * transitions. + */ +void __scx_update_idle(struct rq *rq, bool idle, bool do_notify) +{ + int cpu =3D cpu_of(rq); + + lockdep_assert_rq_held(rq); + + /* + * Trigger ops.update_idle() only when transitioning from a task to + * the idle thread and vice versa. + * + * Idle transitions are indicated by do_notify being set to true, + * managed by put_prev_task_idle()/set_next_task_idle(). + */ + if (SCX_HAS_OP(update_idle) && do_notify && !scx_rq_bypassing(rq)) + SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle); + + /* + * Update the idle masks: + * - for real idle transitions (do_notify =3D=3D true) + * - for idle-to-idle transitions (indicated by the previous task + * being the idle thread, managed by pick_task_idle()) + * + * Skip updating idle masks if the previous task is not the idle + * thread, since set_next_task_idle() has already handled it when + * transitioning from a task to the idle thread (calling this + * function with do_notify =3D=3D true). + * + * In this way we can avoid updating the idle masks twice, + * unnecessarily. + */ + if (static_branch_likely(&scx_builtin_idle_enabled)) + if (do_notify || is_idle_task(rq->curr)) + update_builtin_idle(cpu, idle); +} +#endif /* CONFIG_SMP */ + +/*************************************************************************= ******* + * Helpers that can be called from the BPF scheduler. + */ +__bpf_kfunc_start_defs(); + +static bool check_builtin_idle_enabled(void) +{ + if (static_branch_likely(&scx_builtin_idle_enabled)) + return true; + + scx_ops_error("built-in idle tracking is disabled"); + return false; +} + +/** + * scx_bpf_select_cpu_dfl - The default implementation of ops.select_cpu() + * @p: task_struct to select a CPU for + * @prev_cpu: CPU @p was on previously + * @wake_flags: %SCX_WAKE_* flags + * @is_idle: out parameter indicating whether the returned CPU is idle + * + * Can only be called from ops.select_cpu() if the built-in CPU selection = is + * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is= set. + * @p, @prev_cpu and @wake_flags match ops.select_cpu(). + * + * Returns the picked CPU with *@is_idle indicating whether the picked CPU= is + * currently idle and thus a good candidate for direct dispatching. + */ +__bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, + u64 wake_flags, bool *is_idle) +{ + if (!check_builtin_idle_enabled()) + goto prev_cpu; + + if (!scx_kf_allowed(SCX_KF_SELECT_CPU)) + goto prev_cpu; + +#ifdef CONFIG_SMP + return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle); +#endif + +prev_cpu: + *is_idle =3D false; + return prev_cpu; +} + +/** + * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking + * per-CPU cpumask. + * + * Returns NULL if idle tracking is not enabled, or running on a UP kernel. + */ +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void) +{ + if (!check_builtin_idle_enabled()) + return cpu_none_mask; + +#ifdef CONFIG_SMP + return idle_masks.cpu; +#else + return cpu_none_mask; +#endif +} + +/** + * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking, + * per-physical-core cpumask. Can be used to determine if an entire physic= al + * core is free. + * + * Returns NULL if idle tracking is not enabled, or running on a UP kernel. + */ +__bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void) +{ + if (!check_builtin_idle_enabled()) + return cpu_none_mask; + +#ifdef CONFIG_SMP + if (sched_smt_active()) + return idle_masks.smt; + else + return idle_masks.cpu; +#else + return cpu_none_mask; +#endif +} + +/** + * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kpt= r to + * either the percpu, or SMT idle-tracking cpumask. + * @idle_mask: &cpumask to use + */ +__bpf_kfunc void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask) +{ + /* + * Empty function body because we aren't actually acquiring or releasing + * a reference to a global idle cpumask, which is read-only in the + * caller and is never released. The acquire / release semantics here + * are just used to make the cpumask a trusted pointer in the caller. + */ +} + +/** + * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state + * @cpu: cpu to test and clear idle for + * + * Returns %true if @cpu was idle and its idle state was successfully clea= red. + * %false otherwise. + * + * Unavailable if ops.update_idle() is implemented and + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set. + */ +__bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu) +{ + if (!check_builtin_idle_enabled()) + return false; + + if (ops_cpu_valid(cpu, NULL)) + return scx_idle_test_and_clear_cpu(cpu); + else + return false; +} + +/** + * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu + * @cpus_allowed: Allowed cpumask + * @flags: %SCX_PICK_IDLE_CPU_* flags + * + * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu + * number on success. -%EBUSY if no matching cpu was found. + * + * Idle CPU tracking may race against CPU scheduling state transitions. For + * example, this function may return -%EBUSY as CPUs are transitioning int= o the + * idle state. If the caller then assumes that there will be dispatch even= ts on + * the CPUs as they were all busy, the scheduler may end up stalling with = CPUs + * idling while there are pending tasks. Use scx_bpf_pick_any_cpu() and + * scx_bpf_kick_cpu() to guarantee that there will be at least one dispatch + * event in the near future. + * + * Unavailable if ops.update_idle() is implemented and + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set. + */ +__bpf_kfunc s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed, + u64 flags) +{ + if (!check_builtin_idle_enabled()) + return -EBUSY; + + return scx_pick_idle_cpu(cpus_allowed, flags); +} + +/** + * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick = any CPU + * @cpus_allowed: Allowed cpumask + * @flags: %SCX_PICK_IDLE_CPU_* flags + * + * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick= any + * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle= cpu + * number if @cpus_allowed is not empty. -%EBUSY is returned if @cpus_allo= wed is + * empty. + * + * If ops.update_idle() is implemented and %SCX_OPS_KEEP_BUILTIN_IDLE is n= ot + * set, this function can't tell which CPUs are idle and will always pick = any + * CPU. + */ +__bpf_kfunc s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed, + u64 flags) +{ + s32 cpu; + + if (static_branch_likely(&scx_builtin_idle_enabled)) { + cpu =3D scx_pick_idle_cpu(cpus_allowed, flags); + if (cpu >=3D 0) + return cpu; + } + + cpu =3D cpumask_any_distribute(cpus_allowed); + if (cpu < nr_cpu_ids) + return cpu; + else + return -EBUSY; +} + +__bpf_kfunc_end_defs(); + +BTF_KFUNCS_START(scx_kfunc_ids_idle) +BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE) +BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE) +BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE) +BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle) +BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU) +BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU) +BTF_KFUNCS_END(scx_kfunc_ids_idle) + +static const struct btf_kfunc_id_set scx_kfunc_set_idle =3D { + .owner =3D THIS_MODULE, + .set =3D &scx_kfunc_ids_idle, +}; + +BTF_KFUNCS_START(scx_kfunc_ids_select_cpu) +BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU) +BTF_KFUNCS_END(scx_kfunc_ids_select_cpu) + +static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu =3D { + .owner =3D THIS_MODULE, + .set =3D &scx_kfunc_ids_select_cpu, +}; + +int scx_idle_init(void) +{ + int ret; + + ret =3D register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_se= t_select_cpu) || + register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_= idle) || + register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_idl= e) || + register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_idl= e); + + return ret; +} diff --git a/kernel/sched/ext_idle.h b/kernel/sched/ext_idle.h new file mode 100644 index 000000000000..7a13a74815ba --- /dev/null +++ b/kernel/sched/ext_idle.h @@ -0,0 +1,39 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst + * + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2022 Tejun Heo + * Copyright (c) 2022 David Vernet + * Copyright (c) 2024 Andrea Righi + */ +#ifndef _KERNEL_SCHED_EXT_IDLE_H +#define _KERNEL_SCHED_EXT_IDLE_H + +extern struct static_key_false scx_builtin_idle_enabled; + +#ifdef CONFIG_SMP +extern struct static_key_false scx_selcpu_topo_llc; +extern struct static_key_false scx_selcpu_topo_numa; + +void scx_idle_update_selcpu_topology(void); +void scx_idle_reset_masks(void); +void scx_idle_init_masks(void); +bool scx_idle_test_and_clear_cpu(int cpu); +s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags); +#else /* !CONFIG_SMP */ +static inline void scx_idle_update_selcpu_topology(void) {} +static inline void scx_idle_reset_masks(void) {} +static inline void scx_idle_init_masks(void) {} +static inline bool scx_idle_test_and_clear_cpu(int cpu) { return false; } +static inline s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u6= 4 flags) +{ + return -EBUSY; +} +#endif /* CONFIG_SMP */ + +s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags= , bool *found); + +extern int scx_idle_init(void); + +#endif /* _KERNEL_SCHED_EXT_IDLE_H */ --=20 2.48.1