From nobody Tue Dec 2 00:03:26 2025 Received: from canpmsgout03.his.huawei.com (canpmsgout03.his.huawei.com [113.46.200.218]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B942213AA2D for ; Tue, 25 Nov 2025 07:26:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=113.46.200.218 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764055604; cv=none; b=VQTt629yVhw2RLzvnVEhKwDie4vzre8KFcm/i5MNnWiYQjyH669MQwiovUHcu6iGZJraJnEJ6JpYH40sIIVtnvutBOui3blLY/9EbrhFSUkzDta77ZgLwDzRRQIT6bYSCs9yrGnVsYh103/nMh1S0tiMIzBcEMZI145PcBbuNnY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764055604; c=relaxed/simple; bh=BqqtWEUnwaemZC9zegbqUHk52DADUB7srtYN6PuUR0Q=; h=From:To:CC:Subject:Date:Message-ID:References:In-Reply-To: Content-Type:MIME-Version; b=BNIug+mnLnPs41DVKH6dZMweEnNMeqhr7BTL5N4hoK4xBuxgVUcfo8QGMnhUbrZ9c3hsWpymSStVjB2z7daqMsl1NzLdXPgcRdJ/zjP9vXY0ewOWbyhRbZtiMoiERyKufYGIbikqNP/PXss7F7JjZJqaKryfY8lbMBWtqdfoxfY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b=v6Ztu5na; arc=none smtp.client-ip=113.46.200.218 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=huawei.com header.i=@huawei.com header.b="v6Ztu5na" dkim-signature: v=1; a=rsa-sha256; d=huawei.com; s=dkim; c=relaxed/relaxed; q=dns/txt; h=From; bh=BqqtWEUnwaemZC9zegbqUHk52DADUB7srtYN6PuUR0Q=; b=v6Ztu5na2yQWqAXwvhUoaxRICjIhlyaS5PewZTc723Q26JVXyr2KX0PTWi4znQ+czWPH9me4t ugZIh+/Aacd5m9AHHZlRAhT8DQL30Z0CcY8mwWZnOsgsE2N5a5YFuDSPZX2C9xp9Qs5hwJChW03 kwA1JbMQQDO6uVAPhmPXJl4= Received: from mail.maildlp.com (unknown [172.19.88.105]) by canpmsgout03.his.huawei.com (SkyGuard) with ESMTPS id 4dFvMg4bg0zpSvS; Tue, 25 Nov 2025 15:24:31 +0800 (CST) Received: from kwepemg500009.china.huawei.com (unknown [7.202.181.49]) by mail.maildlp.com (Postfix) with ESMTPS id 02FA51402C1; Tue, 25 Nov 2025 15:26:37 +0800 (CST) Received: from kwepemg100008.china.huawei.com (7.202.181.26) by kwepemg500009.china.huawei.com (7.202.181.49) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Tue, 25 Nov 2025 15:26:36 +0800 Received: from kwepemg100008.china.huawei.com ([7.202.181.26]) by kwepemg100008.china.huawei.com ([7.202.181.26]) with mapi id 15.02.1544.011; Tue, 25 Nov 2025 15:26:36 +0800 From: chenjinghuang To: Steven Rostedt CC: "mingo@redhat.com" , "peterz@infradead.org" , "juri.lelli@redhat.com" , "vincent.guittot@linaro.org" , "dietmar.eggemann@arm.com" , "bsegall@google.com" , "mgorman@suse.de" , "vschneid@redhat.com" , "linux-kernel@vger.kernel.org" Subject: =?gb2312?B?u9i4tDogW1BBVENIXSBzY2hlZC9ydDogcnRvX25leHRfY3B1OiBTa2lwIENQ?= =?gb2312?Q?Us_with_NEED=5FRESCHED?= Thread-Topic: [PATCH] sched/rt: rto_next_cpu: Skip CPUs with NEED_RESCHED Thread-Index: AQHcWw2ba3ZX6dpQgEeXyIuzEbo+qLUC/1iw Date: Tue, 25 Nov 2025 07:26:36 +0000 Message-ID: <4b60e303c2ac4fa0b6dc51e629427492@huawei.com> References: <20251121014004.564508-1-chenjinghuang2@huawei.com> <20251121123811.3d34b10b@gandalf.local.home> In-Reply-To: <20251121123811.3d34b10b@gandalf.local.home> Accept-Language: en-US Content-Language: zh-CN X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Transfer-Encoding: quoted-printable Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" -----=E9=82=AE=E4=BB=B6=E5=8E=9F=E4=BB=B6----- =E5=8F=91=E4=BB=B6=E4=BA=BA: Steven Rostedt =20 =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2025=E5=B9=B411=E6=9C=8822=E6=97=A5 1= :38 =E6=94=B6=E4=BB=B6=E4=BA=BA: chenjinghuang =E6=8A=84=E9=80=81: mingo@redhat.com; peterz@infradead.org; juri.lelli@redh= at.com; vincent.guittot@linaro.org; dietmar.eggemann@arm.com; bsegall@googl= e.com; mgorman@suse.de; vschneid@redhat.com; linux-kernel@vger.kernel.org =E4=B8=BB=E9=A2=98: Re: [PATCH] sched/rt: rto_next_cpu: Skip CPUs with NEED= _RESCHED On Fri, 21 Nov 2025 01:40:04 +0000 Chen Jinghuang wrote: > CPU0 becomes overloaded when hosting a CPU-bound RT task, a=20 > non-CPU-bound RT task, and a CFS task stuck in kernel space. When=20 > other CPUs switch from RT to non-RT tasks, RT load balancing (LB) is=20 > triggered; with HAVE_RT_PUSH_IPI enabled, they send IPIs to CPU0 to=20 > drive the execution of rto_push_irq_work_func. During push_rt_task on=20 > CPU0, if next_task->prio < rq->donor->prio, resched_curr() sets=20 > NEED_RESCHED and after the push operation completes, CPU0 calls rto_next_= cpu(). > Since only CPU0 is overloaded in this scenario, rto_next_cpu() should=20 > ideally return -1 (no further IPI needed). >=20 > However, multiple CPUs invoking tell_cpu_to_push() during LB=20 > increments > rd->rto_loop_next. Even when rd->rto_cpu is set to -1, the mismatch=20 > rd->between rto_loop and rd->rto_loop_next forces rto next_cpu() to=20 > rd->restart its > search from -1. With CPU0 remaining overloaded(""satisfying=20 > rt_nr_migratory && rt_nr_total > 1), it gets reselected, causing CPU0=20 > to queue irq_work to itself and send self-IPIs repeatedly. As long as=20 > CPU0 stays overloaded and other CPUs run pull_rt_tasks(), it falls=20 > into an infinite self-IPI loop, wasting CPU cycles on unnecessary interru= pt handling. Is it truly "infinite", or just wasted due to other CPUs requesting a pull? Also, it appears the issue here is that it's sending to itself. The IPI explosion in this scenario is caused by two combined factors-cross-= CPU=20 IPIs triggered by other CPUs repeatedly initiating pull_rt_tasks(), and sel= f-IPIs sent by CPU0 after reselecting itself in rto_next_cpu(). These two factors form a c= hain reaction,=20 resulting in a "de facto infinite stream of redundant IPIs" while CPU0 rema= ins overloaded. >=20 > The triggering scenario is as follows: >=20 > cpu0 cpu1 cpu2 > pull_rt_task > tell_cpu_to_push > <------------irq_work_queue_on rto_push_irq_work_func > push_rt_task > resched_curr(rq) pull_rt_task > rto_next_cpu tell_cpu_to_push > <-------------------------- atomic_inc(rto_loop_next) > rd->rto_loop !=3D next > rto_next_cpu > irq_work_queue_on > rto_push_irq_work_func >=20 > Fix redundant self-IPI/cross-CPU IPI when target CPU already has a=20 > pending reschedule, making the IPI unnecessary. >=20 > Signed-off-by: Chen Jinghuang > --- > kernel/sched/rt.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) >=20 > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index=20 > 7936d4333731..29ce1af9f121 100644 > --- a/kernel/sched/rt.c > +++ b/kernel/sched/rt.c > @@ -2123,8 +2123,20 @@ static int rto_next_cpu(struct root_domain *rd) > =20 > rd->rto_cpu =3D cpu; > =20 > - if (cpu < nr_cpu_ids) > + if (cpu < nr_cpu_ids) { > + struct task_struct *t; > + struct rq *rq =3D cpu_rq(cpu); > + > + rcu_read_lock(); > + t =3D rcu_dereference(rq->curr); > + if (test_tsk_need_resched(t)) { > + rcu_read_unlock(); > + continue; > + } > + rcu_read_unlock(); > + > return cpu; > + } > =20 > rd->rto_cpu =3D -1; > =20 Instead of skipping need resched, would skipping the current CPU work too? Acknowledge that "sending IPI to itself" is the direct trigger for the loop= . The=20 original approach of checking NEED_RESCHED was an indirect optimization=20 that did not address the core issue. diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 7936d4333731..cacd= 8912cd31 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2100,6 +2100,7 @@ static void push_rt_tasks(struct rq *rq) */ static int rto_next_cpu(struct root_domain *rd) { + int this_cpu =3D smp_processor_id(); int next; int cpu; =20 @@ -2118,10 +2119,13 @@ static int rto_next_cpu(struct root_domain *rd) */ for (;;) { =20 - /* When rto_cpu is -1 this acts like cpumask_first() */ - cpu =3D cpumask_next(rd->rto_cpu, rd->rto_mask); + do { + /* When rto_cpu is -1 this acts like cpumask_first() */ + cpu =3D cpumask_next(rd->rto_cpu, rd->rto_mask); + rd->rto_cpu =3D cpu; =20 - rd->rto_cpu =3D cpu; + /* Do not send IPI to self */ + } while (cpu =3D=3D this_cpu); =20 if (cpu < nr_cpu_ids) return cpu; -- Steve