From nobody Tue Feb 10 02:59:42 2026 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C90B9277C9A for ; Wed, 7 Jan 2026 06:52:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.158.5 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767768748; cv=none; b=eIuD57LGc+caL7AX1pLEinQLtl5XeUVknsdek0BKyi4DpAEQwm8JxvjkjlEWElFSvdy8YBW4dxicxbGSZ1so0SGfG2oTZBlfeNwC6bdC0slrRHa9ePumJi/j1Xy5nZFQvXjPacluCpGTlMaQG4qDbuY1YAY/sK8+z1yrBTQYgKY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767768748; c=relaxed/simple; bh=7l9LKlTd+ElP5z96ofhKUeL+fsTdw/eOuqT4b1Idy/U=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=isCEoaNz9PRaWMQx9LqUpuu13sNtLC3g7FxHp3+Xte9zbIyKJR4A0floiTUphYHtc8IKWVz/Aa8GyXRxPCkvZKYrrFOK6d5u7Fz/qES0pp4Nv+9VxK1ajMgxZFgXxIEte1wWZktI+mBmK2zALYui/A3/4oQZCfZ5TDtTBDftR10= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=nGoSXKNx; arc=none smtp.client-ip=148.163.158.5 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="nGoSXKNx" Received: from pps.filterd (m0360072.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 606I2OfG028326; Wed, 7 Jan 2026 06:51:59 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=kuRDcZEgnr/RXRbGv c8QNDIAGToh5qEWeFt1+ZKLeeo=; b=nGoSXKNxDTyevTPn4a8Y/KLv5pKWjDw8E iXeLOyr6+82Z/looS0t1T/X5WyTxPOYRiC1ynT3bRS8UyL1vSflYUg+4JAu/+YHc LfD3Dt0i3JQEcQManbhlxCtERZRyTpFEk7/TrzzLR+4sSDsgLSOuXM7YFgm/JcPO Epq9r+vgGqdlaooW79qctjRYky5xudXSd+D1Nq7Rct4R9W/DeHG7vSATeRbEnk/L OYJAxu+4jzeJ1SIenoHZ1FpnpmFxuYUYuIFrMIE5bPnZE1aXEy0ZNIzWasZFzcpW 1jpMnOLmQJH2PlLybVQqFjXFHCg+xY/L+0JD7AdTVi63vOLB8/kPw== Received: from ppma11.dal12v.mail.ibm.com (db.9e.1632.ip4.static.sl-reverse.com [50.22.158.219]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4betrtpe7x-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 07 Jan 2026 06:51:59 +0000 (GMT) Received: from pps.filterd (ppma11.dal12v.mail.ibm.com [127.0.0.1]) by ppma11.dal12v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 60752Lwt019177; Wed, 7 Jan 2026 06:51:58 GMT Received: from smtprelay05.fra02v.mail.ibm.com ([9.218.2.225]) by ppma11.dal12v.mail.ibm.com (PPS) with ESMTPS id 4bfg517b24-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 07 Jan 2026 06:51:58 +0000 Received: from smtpav06.fra02v.mail.ibm.com (smtpav06.fra02v.mail.ibm.com [10.20.54.105]) by smtprelay05.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 6076puQD39190816 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 7 Jan 2026 06:51:56 GMT Received: from smtpav06.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5063720049; Wed, 7 Jan 2026 06:51:56 +0000 (GMT) Received: from smtpav06.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id ED7EC20040; Wed, 7 Jan 2026 06:51:53 +0000 (GMT) Received: from li-7bb28a4c-2dab-11b2-a85c-887b5c60d769.ibm.com.com (unknown [9.124.216.12]) by smtpav06.fra02v.mail.ibm.com (Postfix) with ESMTP; Wed, 7 Jan 2026 06:51:53 +0000 (GMT) From: Shrikanth Hegde To: mingo@kernel.org, peterz@infradead.org, vincent.guittot@linaro.org, linux-kernel@vger.kernel.org Cc: sshegde@linux.ibm.com, kprateek.nayak@amd.com, juri.lelli@redhat.com, vschneid@redhat.com, tglx@linutronix.de, dietmar.eggemann@arm.com, anna-maria@linutronix.de, frederic@kernel.org, wangyang.guo@intel.com Subject: [PATCH v3 3/3] sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead Date: Wed, 7 Jan 2026 12:21:25 +0530 Message-ID: <20260107065125.669668-4-sshegde@linux.ibm.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260107065125.669668-1-sshegde@linux.ibm.com> References: <20260107065125.669668-1-sshegde@linux.ibm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Authority-Analysis: v=2.4 cv=aaJsXBot c=1 sm=1 tr=0 ts=695e028f cx=c_pps a=aDMHemPKRhS1OARIsFnwRA==:117 a=aDMHemPKRhS1OARIsFnwRA==:17 a=vUbySO9Y5rIA:10 a=VkNPw1HP01LnGYTKEx00:22 a=VnNF1IyMAAAA:8 a=T4Os_ZD8T57gc3T8qpkA:9 X-Proofpoint-GUID: saDOios9NNVCBmKB5MlgBhYd4SJBHABj X-Proofpoint-ORIG-GUID: saDOios9NNVCBmKB5MlgBhYd4SJBHABj X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwMTA3MDA0OSBTYWx0ZWRfX0U5ENYpmIPj5 k6HMhPq/zmg0r7XLosxRf6HJ0uHyICdZawUS+G82F33p4bV6OPfk53usbr4F7mQHXGN/y9Lffyl b6y9gBNd7AdM2rbUDnLg6nzRLHldaDTl1v13XdJWnH5OzGgkD+Zi8/kZFns7ELxct5Zw2Fu+lti cgMAcg8Jzt+apddu21NVOHAjg3dWlnqnaYzAbkNvxaR1Cl/yzsNrYqSRdbbMD2PKQjIJwQy40h2 k/wOHsysm8lVvhW1XOGjVSbGj9X2fIkH+0OmCZhaTCQKeVBSKTNcenqzCzeHWTPk7o0cC1Ls6Bj TU7s7uETlVTdfu6O4udgt51gdYUhDYNkRG8e7JwPoXryKyencHbU1BteiwgybyuLGotre+LAbrX kEqIcZ8N3SlfpIbI8HVkx882SxLiPlGmajD+I3+hb2CSgEZbNOCCGTkGn478u8lFWhZIlZPQnP8 V2VJuYlgX64/Ttj1yew== X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1121,Hydra:6.1.9,FMLib:17.12.100.49 definitions=2026-01-06_03,2026-01-06_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 adultscore=0 lowpriorityscore=0 bulkscore=0 malwarescore=0 priorityscore=1501 clxscore=1015 phishscore=0 spamscore=0 impostorscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.19.0-2512120000 definitions=main-2601070049 Content-Type: text/plain; charset="utf-8" nohz.nr_cpus was observed as contended cacheline when running enterprise workload on large systems. Fundamental scalability challenge with nohz.idle_cpus_mask and nohz.nr_cpus is the following: (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus (or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's any nohz balancing work to do, in every scheduler tick. (2) nohz_balance_enter_idle() and nohz_balance_exit_idle() (through nohz_balancer_kick() via sched_tick()) modify (write) nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked. The characteristic frequencies are the following: (1) nohz_balancer_kick() happens at scheduler (busy)tick frequency on CPU(which has not gone idle). This is a relatively constant frequency in the ~1 kHz range or lower. (2) happens at idle enter/exit frequency on every CPU that goes to idle. This is workload dependent, but can easily be hundreds of kHz for IO-bound loads and high CPU counts. Ie. can be orders of magnitude higher than (1), in which case a cachemiss at every invocation of (1) is almost inevitable. idle exit will trigger (1) on the CPU which is coming out of idle. There's two types of costs from these functions: (A) scheduler tick cost via (1): this happens on busy CPUs too, and is thus a primary scalability cost. But the rate here is constant and typically much lower than (B), hence the absolute benefit to workload scalability will be lower as well. (B) idle cost via (2): going-to-idle and coming-from-idle costs are secondary concerns, because they impact power efficiency more than they impact scalability. But in terms of absolute cost this scales up with nr_cpus as well, and a much faster rate, and thus may also approach and negatively impact system limits like memory bus/fabric bandwidth. Above mentioned fundamental scalability challenge remains true for nohz.idle_cpus_mask even after this patch. But nr_cpus can be derived from the mask itself. Its usage doesn't warrant a functionally correct value. It can race, at worst an additional load balance may be attempted. So, derive the value from the idle_cpus_mask. This helps to save some bus bandwidth w.r.t to that nohz cacheline(approx 50%). This in turn helps to improve enterprise workload throughput. This theory holds true for CPUMASK_OFFSTACK=3Dy and mostly true for CPUMASK_OFFSTACK=3Dn (last few bits based on NR_CPUs could be in same cacheline as nr_cpus) On system with 480 CPUs, running hackbench 40 process 10000 loops (Avg of 3 runs) baseline: 0.81% hackbench [k] nohz_balance_exit_idle 0.21% hackbench [k] nohz_balancer_kick 0.09% swapper [k] nohz_run_idle_balance With patch: 0.35% hackbench [k] nohz_balance_exit_idle 0.09% hackbench [k] nohz_balancer_kick 0.07% swapper [k] nohz_run_idle_balance [Ingo Molnar: scalability analysis changlog] Signed-off-by: Shrikanth Hegde --- kernel/sched/fair.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c03f963f6216..3408a5beb95b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7144,7 +7144,6 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balanc= e_tmpmask); =20 static struct { cpumask_var_t idle_cpus_mask; - atomic_t nr_cpus; int has_blocked_load; /* Idle CPUS has blocked load */ int needs_update; /* Newly idle CPUs need their next_balance collated */ unsigned long next_balance; /* in jiffy units */ @@ -12466,7 +12465,7 @@ static void nohz_balancer_kick(struct rq *rq) * None are in tickless mode and hence no need for NOHZ idle load * balancing */ - if (unlikely(!atomic_read(&nohz.nr_cpus))) + if (unlikely(cpumask_empty(nohz.idle_cpus_mask))) return; =20 if (rq->nr_running >=3D 2) { @@ -12579,7 +12578,6 @@ void nohz_balance_exit_idle(struct rq *rq) =20 rq->nohz_tick_stopped =3D 0; cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask); - atomic_dec(&nohz.nr_cpus); =20 set_cpu_sd_state_busy(rq->cpu); } @@ -12637,7 +12635,6 @@ void nohz_balance_enter_idle(int cpu) rq->nohz_tick_stopped =3D 1; =20 cpumask_set_cpu(cpu, nohz.idle_cpus_mask); - atomic_inc(&nohz.nr_cpus); =20 /* * Ensures that if nohz_idle_balance() fails to observe our --=20 2.47.3