From nobody Tue Jun 16 20:37:09 2026 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42CBA3161A4 for ; Tue, 21 Apr 2026 05:06:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.220.165.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776748020; cv=none; b=JOEGAH4OeSy0KgstbjCnC7Dm3rRo8j3YEPrwiP8910RReoOuNaddWOq3mVP1G1V3aW3+yx+8QOCwckqEo9leiR6UFer45/ccwJRtAbGebeOWzsXSfZe0QUa/yjijw/ui2bEQAF2UJu0zEnogNzV/QGC9yiN4N/ygVVK+g2KEsPo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776748020; c=relaxed/simple; bh=sIgD6qkaMKZOyBLPIaYxRYrwFafuN6+SlJK3q0Va5JQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=FqvmsYP4kF0HGpZFdCJvIacyBZHAs9LACMFll1E8JB6Pto251ACvVb8BRt72w7f0LPZ/m0W1sMZGd+Yc6b/YBGKJgQyai8SWK5bVIpnP1iJduzsdbVS2vQtiKVI96bLv7ADAvv1iqyqR4CMGn1MzlxW7e3FMQyU8UoevluoR1V8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=ZMXckkN4; arc=none smtp.client-ip=205.220.165.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="ZMXckkN4" Received: from pps.filterd (m0246617.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 63KLuIXk210451; Tue, 21 Apr 2026 05:06:40 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=corp-2025-04-25; bh=cfRnU /CpYvajhfghxOhbjXDFidKSWPNSlZ5b/mr5hP0=; b=ZMXckkN4WqLK4UZ7VmVD2 s1SAWRZ2MiPtwPSRdyNXCoQbDMm3uB2Iz3oIbBiktGTddxY7vNRcB26+Vu+QkgGk mJlAWVMH87c10K/K4AYnYP/esokbejAOlvyd/Ppd7e33qrHUYEQqgsQwQD4wO5HU A+LpU+nBMptUdPvQVtq2cqEJwi74knKz+4vEl1I2dHrDIsnOPiwbndSMEzt+58QD qwjpzbQOrA8jZFYCkS4y3MyhIq9FOs2s88QWYX2YYKKp3wjUBeMHc0k7LvHI6RSV nJUezMK4+Mn/7RLHOwFW/9fo1H5mxzH18yZHiFgm4gM/favO2wxzoYmZfdaZoiqw w== Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.appoci.oracle.com [138.1.37.129]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 4dm2grcq1q-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 21 Apr 2026 05:06:40 +0000 (GMT) Received: from pps.filterd (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.7/8.18.1.7) with ESMTP id 63L56RIc005544; Tue, 21 Apr 2026 05:06:39 GMT Received: from imran-metabox.au.oracle.com (dhcp-10-191-74-155.vpn.oracle.com [10.191.74.155]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 4dn1afjjd1-2; Tue, 21 Apr 2026 05:06:39 +0000 (GMT) From: Imran Khan To: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org Cc: dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org Subject: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs. Date: Tue, 21 Apr 2026 13:06:21 +0800 Message-Id: <20260421050622.19869-2-imran.f.khan@oracle.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260421050622.19869-1-imran.f.khan@oracle.com> References: <20260421050622.19869-1-imran.f.khan@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-21_01,2026-04-20_02,2025-10-01_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 adultscore=0 lowpriorityscore=0 bulkscore=0 spamscore=0 malwarescore=0 suspectscore=0 phishscore=0 mlxscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2604070000 definitions=main-2604210048 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDIxMDA0OCBTYWx0ZWRfXxOlBcDHL9LSj XgptQc71Fr/3DiWyRXEfwptka71W7FeIJ09mMHGX1UkXIidTZVAaYfHUwi42wOPD53rhjcfMjFs qc/LVxaddbDs6pJ3T/vAWgg/nhT5FkOxFyqzXK0oLnPduAuuiV2ZMjjWc2bb6rp1gxO5FQXga67 h6yZ2UuWzljVVttK7mMn++Xy9+kaFOC3lfW8rNCGG8UEzivLah78ckk2KGcomMSduYSaO5nEfIA /MMGWIHfqpeG+DLkdyLEyqN3SYNabdyZVqFKe0/tLGDitRIq8wxMX7wkXjRjI32l4RSzr7aZazk WNPikKKtoliC4jSDmN+VAL0sn8Zh6jpnE4D2Htd+noBI6nFJpuaXHUxMQEcEgEGn4lyX/kxykS1 nr9Z8bUuu/05pzJBGb9JsDFFxaZpnZVuiPIqCZj2HaXG5pcMl+HL4iWHODyNKSEITks/rEfyvZ5 J5nQ2tzYSdOI90McbUw== X-Authority-Analysis: v=2.4 cv=TN51jVla c=1 sm=1 tr=0 ts=69e705e0 b=1 cx=c_pps a=WeWmnZmh0fydH62SvGsd2A==:117 a=WeWmnZmh0fydH62SvGsd2A==:17 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=jiCTI4zE5U7BLdzWsZGv:22 a=7Gl3-_t3PgB9XO-mQDs3:22 a=yPCof4ZbAAAA:8 a=15wLAThW0SGYWq5x0JkA:9 X-Proofpoint-ORIG-GUID: thw4NgH_vuB0fmtHXMF4X-LP0pCTHcol X-Proofpoint-GUID: thw4NgH_vuB0fmtHXMF4X-LP0pCTHcol Content-Type: text/plain; charset="utf-8" On large scale systems, for example with 768 CPUs and cpusets consisting of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance close to or same as now. This causes nohz.next_balance to be perpetually same as current jiffies and thus causing time based check in nohz_balancer_kick() to awlays fail. For example putting dtrace probe at nohz_balancer_kick, on such a system, we can see that nohz.next_balance is at current jiffy at almost each tick: 447 9536 nohz_balancer_kick:entry jiffies=3D9764770863 nohz.next_balance= =3D9764770863 447 9536 nohz_balancer_kick:entry jiffies=3D9764770864 nohz.next_balance= =3D9764770864 447 9536 nohz_balancer_kick:entry jiffies=3D9764770865 nohz.next_balance= =3D9764770865 447 9536 nohz_balancer_kick:entry jiffies=3D9764770866 nohz.next_balance= =3D9764770866 447 9536 nohz_balancer_kick:entry jiffies=3D9764770867 nohz.next_balance= =3D9764770867 447 9536 nohz_balancer_kick:entry jiffies=3D9764770868 nohz.next_balance= =3D9764770868 447 9536 nohz_balancer_kick:entry jiffies=3D9764770869 nohz.next_balance= =3D9764770870 447 9536 nohz_balancer_kick:entry jiffies=3D9764770870 nohz.next_balance= =3D9764770870 447 9536 nohz_balancer_kick:entry jiffies=3D9764770871 nohz.next_balance= =3D9764770871 447 9536 nohz_balancer_kick:entry jiffies=3D9764770872 nohz.next_balance= =3D9764770872 447 9536 nohz_balancer_kick:entry jiffies=3D9764770873 nohz.next_balance= =3D9764770873 447 9536 nohz_balancer_kick:entry jiffies=3D9764770874 nohz.next_balance= =3D9764770874 447 9536 nohz_balancer_kick:entry jiffies=3D9764770875 nohz.next_balance= =3D9764770876 447 9536 nohz_balancer_kick:entry jiffies=3D9764770876 nohz.next_balance= =3D9764770876 447 9536 nohz_balancer_kick:entry jiffies=3D9764770877 nohz.next_balance= =3D9764770877 447 9536 nohz_balancer_kick:entry jiffies=3D9764770878 nohz.next_balance= =3D9764770878 On such system setting nohz.next_balance to next jiffy can cause kick_ilb() to run almost every tick and this in turn can consume a lot of CPU cycles in subsequenet nohz idle balancing. So set nohz.next_balance based on number of currently idle CPUs, such that for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy. This will nohz_balancer_kick to bail out early. Signed-off-by: Imran Khan --- kernel/sched/fair.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ab4114712be74..bd35275a05b38 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags) * Increase nohz.next_balance only when if full ilb is triggered but * not if we only update stats. */ - if (flags & NOHZ_BALANCE_KICK) - nohz.next_balance =3D jiffies+1; + if (flags & NOHZ_BALANCE_KICK) { + unsigned int nr_idle =3D cpumask_weight(nohz.idle_cpus_mask); + + /* + * On large systems, there may always be some idle CPU(s) with + * rq->next_balance close to or at current time, thus causing + * frequent invocation of kick_ilb() from nohz_balancer_kick(). + * Adjust next_balance based on the number of idle CPUs. + */ + nohz.next_balance =3D jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4= : 0); + } =20 ilb_cpu =3D find_new_ilb(); if (ilb_cpu < 0) --=20 2.34.1 From nobody Tue Jun 16 20:37:09 2026 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4BF0231717F for ; Tue, 21 Apr 2026 05:07:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=205.220.165.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776748021; cv=none; b=k05QLkZQ2PHKQv+K7rP46t2LscT+VBwAJoPVwK3d/sReMd2wfKOrkZ/AqdE5OEMyf+Zgh4NQ4brsYXY6n2GyTDiT25fUS8tXZ2NgImdkGuRXUSBWpJqClJbtHg9iRczarw81OD6xxwPFsAZ2cQS4m75xz85AnRBNzBrkU6BtShE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776748021; c=relaxed/simple; bh=LbeUE+E3oYqxXFa/lZKLBs8qMgWamHK3xc0eqgdyzN8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=AqxLj7st29sfZwr39uEEbf4HD3mT0sA1Cnz3Eet9j/lnATRyl5h1USv0JPtOML5VSSn/bD1SYoN5GpvsYPbs6sMAgTzbqMB8CpyLCU3yjDfn5xTd9eq8c4i5rTQfIt7D+vTkaGOpRT+DZyiNwSL1PGx1SJ5An6EhneMOnQs7Zok= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=H/wVnT28; arc=none smtp.client-ip=205.220.165.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="H/wVnT28" Received: from pps.filterd (m0246629.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.11/8.18.1.11) with ESMTP id 63KLuIDa1333537; Tue, 21 Apr 2026 05:06:45 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=corp-2025-04-25; bh=914f6 r/QwsjKarzH+h4sMjm4K7Bye7FlescO+tqnoK8=; b=H/wVnT28tkAfhzuh/2qGO o4fbIZMKo/gHGPtajzYxU9L+mekLEk20WuIfVqnOa86MWdZPuI2p//7RDT0kQohp +dYvcHXfiCtfJQaPqQb4O1PgpUz9mFk5DC82eMNsDwrFC6o5rVX0ITvr8useuKLn KKn1DOm9hcds8ljOAf/h766o8cDy0iogUSdOqBnsHFC4tlU2DU7RWzpvnTQWjsJ+ Ybt0F0ZGPMjjYfPFSXvyQNPTEsT6vuBjPQuf9xWTGIYiozE9nDaWMVvBvfQMDfa6 W5bq2Alr1GBw2xnT6fRp5lWr0vocHVZIM4grp/GNO7kJEeE4cDGcAu7UuO6HKA9i w== Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.appoci.oracle.com [138.1.37.129]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 4dm27vvpnk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 21 Apr 2026 05:06:45 +0000 (GMT) Received: from pps.filterd (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.7/8.18.1.7) with ESMTP id 63L56RIg005544; Tue, 21 Apr 2026 05:06:44 GMT Received: from imran-metabox.au.oracle.com (dhcp-10-191-74-155.vpn.oracle.com [10.191.74.155]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 4dn1afjjd1-3; Tue, 21 Apr 2026 05:06:44 +0000 (GMT) From: Imran Khan To: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org Cc: dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, vschneid@redhat.com, linux-kernel@vger.kernel.org Subject: [PATCH 2/2] sched/fair: distribute nohz ILB work across idle CPUs. Date: Tue, 21 Apr 2026 13:06:22 +0800 Message-Id: <20260421050622.19869-3-imran.f.khan@oracle.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20260421050622.19869-1-imran.f.khan@oracle.com> References: <20260421050622.19869-1-imran.f.khan@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1143,Hydra:6.1.51,FMLib:17.12.100.49 definitions=2026-04-21_01,2026-04-20_02,2025-10-01_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 adultscore=0 lowpriorityscore=0 bulkscore=0 spamscore=0 malwarescore=0 suspectscore=0 phishscore=0 mlxscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.19.0-2604070000 definitions=main-2604210048 X-Authority-Analysis: v=2.4 cv=JYCMa0KV c=1 sm=1 tr=0 ts=69e705e5 b=1 cx=c_pps a=WeWmnZmh0fydH62SvGsd2A==:117 a=WeWmnZmh0fydH62SvGsd2A==:17 a=A5OVakUREuEA:10 a=VkNPw1HP01LnGYTKEx00:22 a=jiCTI4zE5U7BLdzWsZGv:22 a=EIcjfB9IiI4px24ztqRk:22 a=yPCof4ZbAAAA:8 a=Z6aT2bJ7UOw33JPfh9YA:9 X-Proofpoint-ORIG-GUID: kGYD2yXwMApEwaVK_TewJ1WsH8fGYCom X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNDIxMDA0OCBTYWx0ZWRfX1V6tPdpEldDZ kWwTzJxnKwHG9rFV7uBdR5EVf4apUQg2LDwXtCorgA9848lwB/KR6vHt7/CWEXRSF5836RrNLz/ yDRsNh/HrOfNqt2huGomgSWS1rFBakQGGUiONPCJ3iZT71aGmoIzSmtLpDCGMh0GiEWmFQwneXM diUGTuhKhY3swLqKinJN8s2rmZKflY8bymkrXVadOOQy4ctPZOGSUhXSJj9znGocumpVJhMAiKe nPFjblVMfZCkllEJYDheMjE1Ti5gMvEDRJwa+8VYRSiIPNrPD01PuA6i+zL+SXQeH60wVcS6GqV wuGT4p9mr2CvcQUI4Jv9BlCm2LOHvnvRvcaYhtR67TiXR8YyFocQuI05PrkBZUA6zx0ZpdTb6AV D5kol9CuIZbVnjFIISCWm8sqPApZIRvwqXzP57pwqeD9whVlz0kGPWhde7WoprXNbYwkovTM3Ho UjF6aA6+bfJ5YQNdlLw== X-Proofpoint-GUID: kGYD2yXwMApEwaVK_TewJ1WsH8fGYCom Content-Type: text/plain; charset="utf-8" find_new_ilb() uses for_each_cpu_and() to iterate nohz.idle_cpus_mask from the lowest bit upward, returning the first idle housekeeping CPU it finds. This can (unfairly) select the lowest nohz idle CPU most of the times. Fix this by selecting nohz ILB CPU in a round robin way and thus distributing the nohz ILB work (which can be significant on large scale systems) across all eligible idle CPUs. Signed-off-by: Imran Khan --- kernel/sched/fair.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bd35275a05b38..93bdb542ff714 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7213,6 +7213,7 @@ static struct { cpumask_var_t idle_cpus_mask; int has_blocked_load; /* Idle CPUS has blocked load */ int needs_update; /* Newly idle CPUs need their next_balance collated */ + int ilb_cpu_last; /* Last CPU selected for nohz ILB */ unsigned long next_balance; /* in jiffy units */ unsigned long next_blocked; /* Next update of blocked load in jiffies */ } nohz ____cacheline_aligned; @@ -12420,13 +12421,17 @@ static inline int find_new_ilb(void) =20 hk_mask =3D housekeeping_cpumask(HK_TYPE_KERNEL_NOISE); =20 - for_each_cpu_and(ilb_cpu, nohz.idle_cpus_mask, hk_mask) { + for_each_cpu_wrap(ilb_cpu, nohz.idle_cpus_mask, nohz.ilb_cpu_last + 1) { + if (!cpumask_test_cpu(ilb_cpu, hk_mask)) + continue; =20 if (ilb_cpu =3D=3D smp_processor_id()) continue; =20 - if (idle_cpu(ilb_cpu)) + if (idle_cpu(ilb_cpu)) { + nohz.ilb_cpu_last =3D ilb_cpu; return ilb_cpu; + } } =20 return -1; --=20 2.34.1