From nobody Fri Dec 19 13:08:34 2025 Received: from CH1PR05CU001.outbound.protection.outlook.com (mail-northcentralusazon11010055.outbound.protection.outlook.com [52.101.193.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6026332B98E for ; Mon, 8 Dec 2025 09:36:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.193.55 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765186584; cv=fail; b=I2KaxIGAuMbttxwKY/pJN+eCKmlYwzsTMwAX3C9cC5Ljq5CGAg7BKL6EF1pKuZ/ef2JkivS7yzTjr5lEdVDfF+gdFSKOR4TO684485iMJ0DvnfHcdHJn8/FWa8e8Ae5ITH8Az8s49xex9vPG5yp+oYjUkuFXXUsJoGDz1rtNmr0= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765186584; c=relaxed/simple; bh=YUvd2oG8B/KxlVCGkke/40GjxrICmGNvql+hmoAZdE4=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=dEAxzoI3Qtj7qVjl7fI/obZOu4zXiBBN7RJuu4FXPJq//OKrI57hY5cn6M0SSSI9/7ue6BU6g2jYpOUjMouwCIKrj8q3Ll3z/GYIFZdwkr/nBNSpVmz/sqxlYqiI6345YRhFht0w8owM2Pn24iSvW3FSwL3XP4pfuT5QNXdf0rM= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=Ed9KYGNL; arc=fail smtp.client-ip=52.101.193.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="Ed9KYGNL" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=YZ1QhKV7kpR3NKw1mJFH44HM8Ls0Rs1krDPCiePKrqxavzHeIs824AklVbBru6hheIzkquPiovJHKyHj5dT1Ha0FB9oWbVesXJfwZlgyWK1joxvjhy8LPzUCg15IoEaJBd60zK/EN6pOQtpmBRXUu3Q3f/PiXvvpeWK15jgu7gDxna4PHPD3PRg6RCBoJ8+tLwskXwIMFNdJXk32C5py10siqDm8DI6zxqS2IHMloAJP3bvvt0TxRzvg1+XdrXACIESnR3p80bVIzMuaXqJL8EBcrMcvC7xgDc7ly4k6dkdhCEAFcKrSZG4//E2SM9Pv1//hgU/PuUk8x7eS2b0wsg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=l/ohP/WVIMgVjtiNZrDpFDMJnOHyQz8agJno0Pn/9Gw=; b=vJ8yPtgISuWavAHnCGFXbv4qnWwcWaIfyxjO5FfhRdRjSj4VxbGSecc55SV+3y/zr9xS8HZzdL32dViGwlfO9yr5Dvu9T08GGYIuU9Jnexvq5QG11QAjy8UWpgKXydLn0Mu3Y54YwBvkJD7LQ7Dqf/rRPOB7pkQIexSFlsbvIQB/DHX+/qVX77EzH+Auyx27xr88MZiOa1hHutgClBJ6xaDHRJazPl7syr4Xo6gNRDsd2AJls3hnV5emhq93cWJElJe41TeqwkI8ONP4qwAPKb9Bek2fGQ0jC7FSCSddCeAcOsxfUDj4yY8XlRPFO2VXv3QNVURs2vF4+G7Tv0TxtA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=redhat.com smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=l/ohP/WVIMgVjtiNZrDpFDMJnOHyQz8agJno0Pn/9Gw=; b=Ed9KYGNLMrAnXqyHQew0CKix7evwFQbk4wlxJmVFqeW0nQkeK6kd7VKK40l2Ze4HkMEr9j3Zb159W2qGmgBLd6vpzQsYdlsCcSnm3EBkmBVM4RAkUGYD/hyLXAjScUGpU50DvvNiElWtiKqpyX2jhDi5PRUY3dE93NEWRPLBu58= Received: from DS7PR05CA0011.namprd05.prod.outlook.com (2603:10b6:5:3b9::16) by IA0PR12MB7775.namprd12.prod.outlook.com (2603:10b6:208:431::19) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9388.14; Mon, 8 Dec 2025 09:36:18 +0000 Received: from DS3PEPF0000C37C.namprd04.prod.outlook.com (2603:10b6:5:3b9:cafe::7a) by DS7PR05CA0011.outlook.office365.com (2603:10b6:5:3b9::16) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9412.5 via Frontend Transport; Mon, 8 Dec 2025 09:36:17 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=satlexmb07.amd.com; pr=C Received: from satlexmb07.amd.com (165.204.84.17) by DS3PEPF0000C37C.mail.protection.outlook.com (10.167.23.6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9412.4 via Frontend Transport; Mon, 8 Dec 2025 09:36:17 +0000 Received: from BLRKPRNAYAK.amd.com (10.180.168.240) by satlexmb07.amd.com (10.181.42.216) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Mon, 8 Dec 2025 03:36:11 -0600 From: K Prateek Nayak To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Anna-Maria Behnsen , Frederic Weisbecker , Thomas Gleixner CC: , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , K Prateek Nayak , "Gautham R. Shenoy" , Swapnil Sapkal , Shrikanth Hegde , Chen Yu Subject: [RESEND RFC PATCH v2 29/29] [EXPERIMENTAL] sched/fair: Faster alternate for intra-NUMA newidle balance Date: Mon, 8 Dec 2025 09:27:15 +0000 Message-ID: <20251208092744.32737-29-kprateek.nayak@amd.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20251208083602.31898-1-kprateek.nayak@amd.com> References: <20251208083602.31898-1-kprateek.nayak@amd.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: satlexmb07.amd.com (10.181.42.216) To satlexmb07.amd.com (10.181.42.216) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: DS3PEPF0000C37C:EE_|IA0PR12MB7775:EE_ X-MS-Office365-Filtering-Correlation-Id: 256b5e8e-35e8-4aa5-d847-08de363d3d1f X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|376014|7416014|36860700013|82310400026|1800799024; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?WFlON0T2d1w9pQo2w4ZnZ0KWQdzG3emXV37/DKAm/0wYwjcqC+bizIlmaF6n?= =?us-ascii?Q?ppNoFA6O6eAbr48KCcSA3BmXXkbcKwDlUZaS/ugkcDDY8GbPBPVG1gL8bxxW?= =?us-ascii?Q?dLRN/Gu8AZeg+OraoGSi0uaPoCE9CBmhGR7QZVl8EFLv6VT297TXl+iA7JD3?= =?us-ascii?Q?p0WwCf2CJZGQv6vL/hR+nqytLnKsyZbUmYNWeost+kvGUNczx+rYBLoM/Fh9?= =?us-ascii?Q?lpNvOdIIpPAdWJtIRKBt+CxfFmC241ext6ebpgmG5B/SgOLmmi6Uw4xh9MH8?= =?us-ascii?Q?XA9SLsSzWIbY+KjoFIyiqy068ASRgoWFKHsGC1D92nxkUzMw1EJ2MaBNOlMo?= =?us-ascii?Q?aUkfI7CsXUIH/BqSoBfoOi1VFkd0xRgjhT2xkt0i35bYSPVAR6q6tprfWqT1?= =?us-ascii?Q?fv6qze1ZzH60Uu+QE0Q56NNPYhswV6tXukMVWBmtbWND87NijK95s/ZwONNa?= =?us-ascii?Q?aFX5NjUQw+BM/F0yL4v8QBjpBDGVnWOLcGhyuiYPFvv/1eM+/WhRpSELWy4/?= =?us-ascii?Q?ZHC+Oj89ZVjJfcU/tLscvlKI3fLLjkI5zpCUcEyyJ0jjQNupS1mvq/kg3ZJd?= =?us-ascii?Q?SFqZY0VcQ0b8QReTzfpECazpvASiFCpk/L3T7Mk9EW3+IxH1YG2ONsKcHgIB?= =?us-ascii?Q?pcSmnCPwmEQ/m6k4cnM88Jh0jIw5QrUlHJi+3GjaAfEBZGJvbElahiHA7lOG?= =?us-ascii?Q?9RvDaFRR+grnnGveUOOEazaFiuU/MJnri7BJwXhcC0ueU0UsAWbnKfp7vXI4?= =?us-ascii?Q?yJKuyyHe4XcoDACKpPic8QD8eLf00mIEQMlc9nK4aka1SDr/DvFidv2AuQ38?= =?us-ascii?Q?148mYr4S/nwcLMyqqnz0UAS8d1/M4izZda4Ih5pkfrnpmc4PVA63OvgbR+lO?= =?us-ascii?Q?dHk/m+DQoIdfFSCoNUdqds8OQ8NxRLBcY2owqlbJAL6BHrEhJf+MSpyuarW0?= =?us-ascii?Q?WrQlpNEljB+pQvlLvIjiLebBvNBOv8G7yASc6ELkwjrZTiNOOQ+/UX/kHBbN?= =?us-ascii?Q?sP66dUV1urx8Jxi8oTO1/2OI7e+gPDT/zOp1ib+pODPpCBMx915XUAWPhGZ2?= =?us-ascii?Q?ZrEIyLk6EumwSNqI2ed+tpBYJ5VvA/YcZ1w7tk1y6LZiK4fMWREJp5Z+15O7?= =?us-ascii?Q?uQSe8z4k4XBin19Uq1TXp3soHbGrE2gZA4PflnFGS7yFl5jQ1MrtmvbmbQGx?= =?us-ascii?Q?EUOFWO/sBgeOtDlhEiCHEcQxby+eOwXhOkdWzzY1CqyOgNyVtKyJX8R83dst?= =?us-ascii?Q?Eul0m5HO1EGeI6Cp/JmdVTCQZ/9u/eoTmBsitms3SU7FJbG43gLemWIDJuO4?= =?us-ascii?Q?Gnzcfq1oHg9bBMFfeveLve2cHydOFxVlgz27oQ5FOqCwET84u+XIKeTwPkrs?= =?us-ascii?Q?+hGPgmKxWMi6MqAj2nwoJbu+qgiJoOELQsuJ1nOduFrQKe6LQD/e3/5TIR3C?= =?us-ascii?Q?RgHv0Ztksxagk8/g0/NMjX67GQWqCOfjtsBdvNlqSlpKmvvHAfUXSuGJSQGl?= =?us-ascii?Q?Uba+w+Z6jh3fOFI/juQdY/Fq6NfQiG2FkiijTfSHPIVh/w7SkkjuG9gqO8Th?= =?us-ascii?Q?kX0GEa0L64hlfRdaGus=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:satlexmb07.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(376014)(7416014)(36860700013)(82310400026)(1800799024);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 08 Dec 2025 09:36:17.8781 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 256b5e8e-35e8-4aa5-d847-08de363d3d1f X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[satlexmb07.amd.com] X-MS-Exchange-CrossTenant-AuthSource: DS3PEPF0000C37C.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA0PR12MB7775 Content-Type: text/plain; charset="utf-8" Kernels that enable CONFIG_PREEMPTION only pull a single task during newidle balance to keep the latency low. In standard newidle balance path, the computation of busiest group, busiest rq, and then pulling a single task from the busy rq adds a lot of overhead. During the discussions at OSPM around overheads of load balancing, Peter suggested trying out a different strategy for inter-NUMA newidle balance with the goal of pulling a task as quickly as possible. Try out an alternative strategy of newidle balance of directry traversing the CPUs in the sched domain to pull runnable tasks. Signed-off-by: K Prateek Nayak --- kernel/sched/fair.c | 119 ++++++++++++++++++++++++++++++++++---------- 1 file changed, 93 insertions(+), 26 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 46d33ab63336..aa2821a9b800 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11701,6 +11701,11 @@ static int need_active_balance(struct lb_env *env) =20 static int active_load_balance_cpu_stop(void *data); =20 +static inline bool sched_newidle_stop_balance(struct rq *rq) +{ + return (rq->nr_running > 0 || rq->ttwu_pending); +} + static int should_we_balance(struct lb_env *env) { struct cpumask *swb_cpus =3D this_cpu_cpumask_var_ptr(should_we_balance_t= mpmask); @@ -11722,7 +11727,7 @@ static int should_we_balance(struct lb_env *env) * to optimize wakeup latency. */ if (env->idle =3D=3D CPU_NEWLY_IDLE) { - if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending) + if (sched_newidle_stop_balance(env->dst_rq)) return 0; return 1; } @@ -13256,6 +13261,7 @@ static inline void fair_queue_pushable_tasks(struct= rq *rq) { } */ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf) { + struct cpumask *cpus =3D this_cpu_cpumask_var_ptr(load_balance_mask); unsigned long next_balance =3D jiffies + HZ; int this_cpu =3D this_rq->cpu; int continue_balancing =3D 1; @@ -13315,8 +13321,11 @@ static int sched_balance_newidle(struct rq *this_r= q, struct rq_flags *rf) t0 =3D sched_clock_cpu(this_cpu); sched_balance_update_blocked_averages(this_cpu); =20 + cpumask_clear(cpus); + rcu_read_lock(); for_each_domain(this_cpu, sd) { + unsigned int weight =3D 1; u64 domain_cost; =20 update_next_balance(sd, &next_balance); @@ -13324,40 +13333,98 @@ static int sched_balance_newidle(struct rq *this_= rq, struct rq_flags *rf) if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) break; =20 - if (sd->flags & SD_BALANCE_NEWIDLE) { - unsigned int weight =3D 1; + if (!(sd->flags & SD_BALANCE_NEWIDLE)) + continue; =20 - if (sched_feat(NI_RANDOM)) { - /* - * Throw a 1k sided dice; and only run - * newidle_balance according to the success - * rate. - */ - u32 d1k =3D sched_rng() % 1024; - weight =3D 1 + sd->newidle_ratio; - if (d1k > weight) { - update_newidle_stats(sd, 0); - continue; - } - weight =3D (1024 + weight/2) / weight; + if (sched_feat(NI_RANDOM)) { + /* + * Throw a 1k sided dice; and only run + * newidle_balance according to the success + * rate. + */ + u32 d1k =3D sched_rng() % 1024; + weight =3D 1 + sd->newidle_ratio; + if (d1k > weight) { + update_newidle_stats(sd, 0); + continue; } + weight =3D (1024 + weight/2) / weight; + } =20 - pulled_task =3D sched_balance_rq(this_cpu, this_rq, - sd, CPU_NEWLY_IDLE, - &continue_balancing); =20 - t1 =3D sched_clock_cpu(this_cpu); - domain_cost =3D t1 - t0; - curr_cost +=3D domain_cost; - t0 =3D t1; + /* + * Non-preemptible kernels can pull more than one task during + * newidle balance and NUMA domains may need special + * consideration to preserve tasks on preferred NUMA node. + * + * Only take fast-path on preemptible kernels for intra NUMA + * domains. + */ + if (!IS_ENABLED(CONFIG_PREEMPTION) || (sd->flags & SD_NUMA)) { + pulled_task =3D sched_balance_rq(this_cpu, this_rq, + sd, CPU_NEWLY_IDLE, + &continue_balancing); + } else { + struct lb_env env =3D { + .sd =3D sd, + .dst_cpu =3D this_cpu, + .dst_rq =3D this_rq, + .idle =3D CPU_NEWLY_IDLE, + }; + int cpu; =20 /* - * Track max cost of a domain to make sure to not delay the - * next wakeup on the CPU. + * Clear the CPUs of child domain. They have already + * been visited during last balance. !NUMA domains do + * not overlap so simply excluding the previous + * domain's span should be enough. */ - update_newidle_cost(sd, domain_cost, weight * !!pulled_task); + cpumask_andnot(cpus, sched_domain_span(sd), cpus); + + /* Commit to searching the sd if we are idle at start. */ + continue_balancing =3D sched_newidle_stop_balance(this_rq); + if (!continue_balancing) + break; + + for_each_cpu_wrap(cpu, cpus, this_cpu + 1) { + struct task_struct *p =3D NULL; + struct rq *rq =3D cpu_rq(cpu); + + /* Not overloaded with runnable tasks. */ + if (rq->cfs.h_nr_runnable <=3D 1) + continue; + + scoped_guard(rq_lock, rq) { + /* Check again with rq lock held. */ + if (rq->cfs.h_nr_runnable <=3D 1) + break; + + env.src_cpu =3D cpu; + env.src_rq =3D rq; + + update_rq_clock(rq); + p =3D detach_one_task(&env); + } + + if (p) { + attach_one_task(this_rq, p); + pulled_task =3D 1; + break; + } + } } =20 + t1 =3D sched_clock_cpu(this_cpu); + domain_cost =3D t1 - t0; + curr_cost +=3D domain_cost; + t0 =3D t1; + + /* + * Track max cost of a domain to make sure to not delay the + * next wakeup on the CPU. + */ + update_newidle_cost(sd, domain_cost, weight * !!pulled_task); + /* * Stop searching for tasks to pull if there are * now runnable tasks on this rq. --=20 2.43.0