From nobody Sun Feb 8 12:19:12 2026 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 637FD30EF7E for ; Thu, 15 Jan 2026 07:36:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768462595; cv=none; b=WocCsIbL/NWl0T88dApR1Or3mXBJnT4+rU7+INBUqPZSTVN62iuP728PbUp6HvgXCjzZoVq3nQQQh+PPvFY21qnKcpJawcjjdwYnx/RW9Ma9cZ/cNGz26f0ENJDwx95CRSGGD0nV7UK98SEn1fVSCrVwo8Fux61sZ58GWuDiuDQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1768462595; c=relaxed/simple; bh=416Yf0+1au9gv6zKes3Pp5D9ke3bLsJvXy7/DT61vak=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Xfi+OFHmrGlmXIWikcwaMuiPyvG//9cJMIBE0wog2yRtngIom5lC6WGB7qAjnx393J8vmsJMDzxZGDqj/VO72Dsr0Q8giUkIQBgSh878IAMRw1kh75P3CxeI+G9OWZ7pDVApvGl5eLAUaNkLZhLHhwDcAWe5lJGcFoOEYUGraB4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=pwsmQpch; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="pwsmQpch" Received: from pps.filterd (m0356517.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 60F0TqDe021001; Thu, 15 Jan 2026 07:35:52 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=cc :content-transfer-encoding:date:from:in-reply-to:message-id :mime-version:references:subject:to; s=pp1; bh=Zn71CbyHsNRbWo38F mycbcx9xweAzkbFMI11t6HX+C8=; b=pwsmQpchDR/kOH7Iq0gGy5JlIzxUTXxU8 eEND5lrl4rW6Gx3c4l/pWCvmFpVocf7nhfn2kidsCelY4/fvoNE1Tgzm9BNZdEQb 6RVNKkyth/h2verT8d8/XUuxKkd+dMhb0GZBp4i5egTZxiAEsD0QVaG29U6mO+P3 W5flXaMF3x8xCMcQA7oTj61FkSvkuY0qf5R93zXjjh+P4qvhTod01CdIHEwCVARV lNj9rJlL/sJsK4qmjOCxTfaAeAYCnJ6RwAZwEeBB6Y/esL7dvfDNoLpw84cwqcRP y3Ij6Cx3SO7nI9qKuSkVBx7S39bhlymAXOGsAm1es3gJG7p//YSfg== Received: from ppma21.wdc07v.mail.ibm.com (5b.69.3da9.ip4.static.sl-reverse.com [169.61.105.91]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 4bkeg4na94-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 15 Jan 2026 07:35:51 +0000 (GMT) Received: from pps.filterd (ppma21.wdc07v.mail.ibm.com [127.0.0.1]) by ppma21.wdc07v.mail.ibm.com (8.18.1.2/8.18.1.2) with ESMTP id 60F4wqwa025549; Thu, 15 Jan 2026 07:35:50 GMT Received: from smtprelay05.fra02v.mail.ibm.com ([9.218.2.225]) by ppma21.wdc07v.mail.ibm.com (PPS) with ESMTPS id 4bm23nenqw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 15 Jan 2026 07:35:50 +0000 Received: from smtpav06.fra02v.mail.ibm.com (smtpav06.fra02v.mail.ibm.com [10.20.54.105]) by smtprelay05.fra02v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 60F7Zku845023634 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 15 Jan 2026 07:35:46 GMT Received: from smtpav06.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 3BB002004B; Thu, 15 Jan 2026 07:35:46 +0000 (GMT) Received: from smtpav06.fra02v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id AE86620040; Thu, 15 Jan 2026 07:35:43 +0000 (GMT) Received: from li-7bb28a4c-2dab-11b2-a85c-887b5c60d769.ibm.com.com (unknown [9.39.17.239]) by smtpav06.fra02v.mail.ibm.com (Postfix) with ESMTP; Thu, 15 Jan 2026 07:35:43 +0000 (GMT) From: Shrikanth Hegde To: mingo@kernel.org, peterz@infradead.org, vincent.guittot@linaro.org, linux-kernel@vger.kernel.org Cc: sshegde@linux.ibm.com, kprateek.nayak@amd.com, juri.lelli@redhat.com, vschneid@redhat.com, tglx@kernel.org, dietmar.eggemann@arm.com, anna-maria@linutronix.de, frederic@kernel.org, wangyang.guo@intel.com Subject: [PATCH v5 3/3] sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead Date: Thu, 15 Jan 2026 13:05:24 +0530 Message-ID: <20260115073524.376643-4-sshegde@linux.ibm.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260115073524.376643-1-sshegde@linux.ibm.com> References: <20260115073524.376643-1-sshegde@linux.ibm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-TM-AS-GCONF: 00 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwMTE1MDA0NyBTYWx0ZWRfX0snL5exMzSJ0 VqWGli66qzF1dwwDeJKC4FuPlHXnMjGq/6C6BkUy0SGoJquWbIFEbpBYDka3yrOD4aAw6G+iatV ogIMbbQ3iWtweCV05UyPZnu7vsbNiPkiU6OITiuHhVU1u08Ug4YHA7Z/zptqB9DF2/iDyxWxZR8 2jJyhmTL7vqSr9bXoFhF21jTLe35taOGB7E2XsAAL+csvRAYNxZH0AyeQWLN1OY/pfXTAbe4CSL DKTw+hXtGU/tTlfQsEIuDm7DLmfkkNXtXE+/7r/Chfa/dnrnXtpDF7bZi1QK3cT8EddLGbAbk8J gi6qs5jSvshYZswv4AplYGhcfF5jusP4AyjUW3r1tv3wNeQRMY4oiipe2DqtqEjh8Ux29BLoUam SryrEBR7MsKqr992aC38ZT7lCuARDfx9hFyJLJBIM0vuRBnb9htUmvhm3tE9MPnDv62fKX/Fw5r J8bOtMWc5w/LSHuZS9Q== X-Proofpoint-ORIG-GUID: A-yRahLKDycdQLsb61gfC_NAyXMfni67 X-Authority-Analysis: v=2.4 cv=B/60EetM c=1 sm=1 tr=0 ts=696898d8 cx=c_pps a=GFwsV6G8L6GxiO2Y/PsHdQ==:117 a=GFwsV6G8L6GxiO2Y/PsHdQ==:17 a=vUbySO9Y5rIA:10 a=VkNPw1HP01LnGYTKEx00:22 a=20KFwNOVAAAA:8 a=zd2uoN0lAAAA:8 a=VnNF1IyMAAAA:8 a=T4Os_ZD8T57gc3T8qpkA:9 X-Proofpoint-GUID: A-yRahLKDycdQLsb61gfC_NAyXMfni67 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1121,Hydra:6.1.9,FMLib:17.12.100.49 definitions=2026-01-15_02,2026-01-14_01,2025-10-01_01 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 suspectscore=0 bulkscore=0 spamscore=0 impostorscore=0 malwarescore=0 phishscore=0 adultscore=0 clxscore=1015 lowpriorityscore=0 classifier=typeunknown authscore=0 authtc= authcc= route=outbound adjust=0 reason=mlx scancount=1 engine=8.19.0-2512120000 definitions=main-2601150047 Content-Type: text/plain; charset="utf-8" nohz.nr_cpus was observed as contended cacheline when running enterprise workload on large systems. Fundamental scalability challenge with nohz.idle_cpus_mask and nohz.nr_cpus is the following: (1) nohz_balancer_kick() observes (reads) nohz.nr_cpus (or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's any nohz balancing work to do, in every scheduler tick. (2) nohz_balance_enter_idle() and nohz_balance_exit_idle() (through nohz_balancer_kick() via sched_tick()) modify (write) nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked. The characteristic frequencies are the following: (1) nohz_balancer_kick() happens at scheduler (busy)tick frequency on CPU(which has not gone idle). This is a relatively constant frequency in the ~1 kHz range or lower. (2) happens at idle enter/exit frequency on every CPU that goes to idle. This is workload dependent, but can easily be hundreds of kHz for IO-bound loads and high CPU counts. Ie. can be orders of magnitude higher than (1), in which case a cachemiss at every invocation of (1) is almost inevitable. idle exit will trigger (1) on the CPU which is coming out of idle. There's two types of costs from these functions: (A) scheduler tick cost via (1): this happens on busy CPUs too, and is thus a primary scalability cost. But the rate here is constant and typically much lower than (B), hence the absolute benefit to workload scalability will be lower as well. (B) idle cost via (2): going-to-idle and coming-from-idle costs are secondary concerns, because they impact power efficiency more than they impact scalability. But in terms of absolute cost this scales up with nr_cpus as well, and a much faster rate, and thus may also approach and negatively impact system limits like memory bus/fabric bandwidth. Note that nohz.idle_cpus_mask and nohz.nr_cpus may appear to reside in the same cacheline, however under CONFIG_CPUMASK_OFFSTACK=3Dy the backing stora= ge for nohz.idle_cpus_mask will be elsewhere. With CPUMASK_OFFSTACK=3Dn, the nohz.idle_cpus_mask and rest of nohz fields are in different cachelines under typical NR_CPUS=3D512/2048. This implies two separate cachelines being dirtied upon idle entry / exit. nohz.nr_cpus can be derived from the mask itself. Its usage doesn't warrant a functionally correct value. This means one less cacheline being dirtied in idle entry/exit path which helps to save some bus bandwidth w.r.t to those nohz functions(approx 50%). This in turn helps to improve enterprise workload throughput. On system with 480 CPUs, running "hackbench 40 process 10000 loops" (Avg of 3 runs) baseline: 0.81% hackbench [k] nohz_balance_exit_idle 0.21% hackbench [k] nohz_balancer_kick 0.09% swapper [k] nohz_run_idle_balance With patch: 0.35% hackbench [k] nohz_balance_exit_idle 0.09% hackbench [k] nohz_balancer_kick 0.07% swapper [k] nohz_run_idle_balance [Ingo Molnar: scalability analysis changlog] Reviewed-by: Valentin Schneider Reviewed-and-tested-by: K Prateek Nayak Signed-off-by: Shrikanth Hegde --- kernel/sched/fair.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3d843d1396ec..46ed16466be4 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7141,7 +7141,6 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balanc= e_tmpmask); =20 static struct { cpumask_var_t idle_cpus_mask; - atomic_t nr_cpus; int has_blocked_load; /* Idle CPUS has blocked load */ int needs_update; /* Newly idle CPUs need their next_balance collated */ unsigned long next_balance; /* in jiffy units */ @@ -12465,7 +12464,7 @@ static void nohz_balancer_kick(struct rq *rq) * None are in tickless mode and hence no need for NOHZ idle load * balancing */ - if (unlikely(!atomic_read(&nohz.nr_cpus))) + if (unlikely(cpumask_empty(nohz.idle_cpus_mask))) return; =20 if (rq->nr_running >=3D 2) { @@ -12578,7 +12577,6 @@ void nohz_balance_exit_idle(struct rq *rq) =20 rq->nohz_tick_stopped =3D 0; cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask); - atomic_dec(&nohz.nr_cpus); =20 set_cpu_sd_state_busy(rq->cpu); } @@ -12636,7 +12634,6 @@ void nohz_balance_enter_idle(int cpu) rq->nohz_tick_stopped =3D 1; =20 cpumask_set_cpu(cpu, nohz.idle_cpus_mask); - atomic_inc(&nohz.nr_cpus); =20 /* * Ensures that if nohz_idle_balance() fails to observe our --=20 2.51.0