From nobody Sun Apr  5 13:43:57 2026
Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id AE6773502B8
	for <linux-kernel@vger.kernel.org>; Tue, 24 Mar 2026 07:53:59 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=13.77.154.182
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1774338849; cv=none;
 b=TI1kG+Gxv5d1vkRbfadE5DbIhs1oEKJEHS2kMOL4zZ3gSmWaxjEor5mEdW1EA8eqU7zcCcf1UjV4hrQ3m9dve8JFjtHw5NrZhYQpEA1ybPem0FeSWXR+Gzan31EsqAzIjPmIApxNVY3nmNBH6VQm+bpvZCNhqYSEVF+K/YPQb9w=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1774338849; c=relaxed/simple;
	bh=TJqsqi7VwMOWA+fwKyLel7poNXhIRNkNp8TAZhVJeeo=;
	h=From:To:Cc:Subject:Date:Message-ID:MIME-Version;
 b=Sw7cxFFUDCDbs8ko1b/v862pYMj54SYtdDB/oI7tRbuVm5l8hS7wSe6JI7Ko7CtSCfpEjylFCFiATZ76ZSKXdx3GqqSIfSpfJ6Q+juds8Y1TXWM4NNTIqcZBJxEw70TpdpB3XlBN18lXVO5cCfHOPPlLLi7u8vZ/0wcjN+/epj4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.microsoft.com;
 spf=pass smtp.mailfrom=linux.microsoft.com;
 dkim=pass (1024-bit key) header.d=linux.microsoft.com
 header.i=@linux.microsoft.com header.b=LS3uMkyZ;
 arc=none smtp.client-ip=13.77.154.182
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=linux.microsoft.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=linux.microsoft.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.microsoft.com
 header.i=@linux.microsoft.com header.b="LS3uMkyZ"
Received: from CPC-namja-026ON.redmond.corp.microsoft.com (unknown
 [4.213.232.17])
	by linux.microsoft.com (Postfix) with ESMTPSA id 5088820B710C;
	Tue, 24 Mar 2026 00:53:56 -0700 (PDT)
DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 5088820B710C
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com;
	s=default; t=1774338838;
	bh=d9Zo/tJk8IGu8TPDmZCE9nCq1pnLtOfp2X5a9LuvXF0=;
	h=From:To:Cc:Subject:Date:From;
	b=LS3uMkyZDxzYM2HGZxKHWBDvcDb9xLzY8ze3sCr4UrAHgwtOYc7QMy0V3mpzNSK2x
	 K84cUWFMRBEscqkNTzlYNnfG//TV7s+xCsX6bp073PmtzXQnyMJ5ctQviYqQHL8mg2
	 kp/1kRJIYAQxHg0bgorr/hqJNiBl245JxOBXQsIc=
From: Naman Jain <namjain@linux.microsoft.com>
To: Thomas Gleixner <tglx@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Tianyou Li <tianyou.li@intel.com>,
	Wangyang Guo <wangyang.guo@intel.com>,
	Tim Chen <tim.c.chen@linux.intel.com>,
	linux-kernel@vger.kernel.org,
	Long Li <longli@microsoft.com>,
	Naman Jain <namjain@linux.microsoft.com>
Subject: [PATCH] lib/group_cpus: rotate extra groups to avoid IRQ stacking
Date: Tue, 24 Mar 2026 07:53:52 +0000
Message-ID: <20260324075352.2326972-1-namjain@linux.microsoft.com>
X-Mailer: git-send-email 2.43.0
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

When multiple devices call group_cpus_evenly() with the same number of
groups, the cluster-aware path in __try_group_cluster_cpus() assigns
extra groups to the same set of clusters every time, producing identical
affinity masks for every caller. CPUs in clusters that receive two
groups (and thus get single-CPU dedicated masks) end up handling
interrupts from ALL devices, creating an IRQ imbalance.

For example, on a 96-CPU / 2-NUMA-node system with 24 clusters of
2 CPUs each and 6 NVMe disks each requesting 62 vectors:
alloc_groups_to_nodes() distributes 31 groups across 24 clusters,
giving 7 clusters 2 groups (single-CPU mask =3D dedicated) and 17
clusters 1 group (2-CPU mask =3D shared). Because the assignment is
deterministic, all 6 disks produce the same mapping and the same 14
CPUs each accumulate 6 dedicated IRQs -- roughly twice the interrupt
load of other CPUs -- causing up to 11% per-disk throughput degradation
on IRQ-heavy CPUs.

Fix this by introducing a per-caller rotation offset via a static
atomic counter. After alloc_groups_to_nodes() determines each
cluster's group count, collect the extras (groups above the per-cluster
minimum), then redistribute them starting from a rotated position with
a stride of ncluster/extras so that successive callers scatter their
extra groups across different clusters. A capacity check
(cpumask_weight_and) ensures no cluster is assigned more groups than it
has CPUs, with a fallback loop for any extras that could not be placed
in the strided pass.

For systems without cluster topology, the same rotation is applied in
assign_cpus_to_groups() at the per-group level: the modular expression
(v + spread_offset) % nv->ngroups selects which groups receive the
extra CPU, replacing the previous sequential decrement.

Tested on a 96-vCPU Hyper-V VM (AMD EPYC 9V74, 48 clusters of 2, 2
NUMA nodes) with 6 NVMe data disks (63 MSI-X vectors each), running
4K random-read fio (psync, 16 jobs/disk, 120s, CPU-pinned):
- Worst-disk degradation vs average: 11% -> 5%
- Previously penalized disks: +12% IOPS, 10% lower latency

Fixes: 89802ca36c96 ("lib/group_cpus: make group CPU cluster aware")
Co-developed-by: Long Li <longli@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
---
 lib/group_cpus.c | 116 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 104 insertions(+), 12 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index e6e18d7a49bb..241c2c9b437d 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -7,6 +7,7 @@
 #include <linux/slab.h>
 #include <linux/cpu.h>
 #include <linux/sort.h>
+#include <linux/atomic.h>
 #include <linux/group_cpus.h>
=20
 #ifdef CONFIG_SMP
@@ -255,12 +256,20 @@ static void alloc_nodes_groups(unsigned int numgrps,
 	alloc_groups_to_nodes(numgrps, numcpus, node_groups, nr_node_ids);
 }
=20
+/*
+ * Per-caller rotation counter for group_cpus_evenly().
+ * Wrapping is harmless: the offset is only used modulo small values
+ * (ncluster or nv->ngroups), so any unsigned value works.
+ */
+static atomic_t group_spread_cnt =3D ATOMIC_INIT(0);
+
 static void assign_cpus_to_groups(unsigned int ncpus,
 				  struct cpumask *nmsk,
 				  struct node_groups *nv,
 				  struct cpumask *masks,
 				  unsigned int *curgrp,
-				  unsigned int last_grp)
+				  unsigned int last_grp,
+				  unsigned int spread_offset)
 {
 	unsigned int v, cpus_per_grp, extra_grps;
 	/* Account for rounding errors */
@@ -270,11 +279,15 @@ static void assign_cpus_to_groups(unsigned int ncpus,
 	for (v =3D 0; v < nv->ngroups; v++, *curgrp +=3D 1) {
 		cpus_per_grp =3D ncpus / nv->ngroups;
=20
-		/* Account for extra groups to compensate rounding errors */
-		if (extra_grps) {
+		/*
+		 * Rotate which groups get the extra CPU so that
+		 * successive callers produce different mappings,
+		 * avoiding IRQ stacking when multiple devices
+		 * share the same CPU topology.
+		 */
+		if (extra_grps &&
+		    (v + spread_offset) % nv->ngroups < extra_grps)
 			cpus_per_grp++;
-			--extra_grps;
-		}
=20
 		/*
 		 * wrapping has to be considered given 'startgrp'
@@ -361,7 +374,8 @@ static bool __try_group_cluster_cpus(unsigned int ncpus,
 				     struct cpumask *node_cpumask,
 				     struct cpumask *masks,
 				     unsigned int *curgrp,
-				     unsigned int last_grp)
+				     unsigned int last_grp,
+				     unsigned int spread_offset)
 {
 	struct node_groups *cluster_groups;
 	const struct cpumask **clusters;
@@ -379,6 +393,78 @@ static bool __try_group_cluster_cpus(unsigned int ncpu=
s,
 	if (ncluster =3D=3D 0)
 		goto fail_no_clusters;
=20
+	/*
+	 * Rotate which clusters receive extra groups so that different
+	 * callers of group_cpus_evenly() produce different group-to-CPU
+	 * mappings. Without this, all devices get identical affinity
+	 * masks, causing IRQ stacking on CPUs assigned single-CPU groups.
+	 *
+	 * alloc_groups_to_nodes() distributes ngroups proportionally, but
+	 * rounding causes some clusters to get one more group than others.
+	 * The assignment is deterministic, so every device gets the same
+	 * mapping. Fix: collect the "extra" groups (above the per-cluster
+	 * minimum), then redistribute them starting from a rotated
+	 * position. The ncpus constraint (ngroups <=3D ncpus) is preserved
+	 * by checking each cluster's actual CPU count before adding.
+	 *
+	 * Note: after alloc_groups_to_nodes(), the union field holds
+	 * ngroups (not ncpus), so we recompute ncpus from cluster masks.
+	 */
+	if (ncluster > 1) {
+		unsigned int min_grps =3D UINT_MAX;
+		unsigned int total_extra =3D 0;
+		unsigned int start, stride;
+
+		for (i =3D 0; i < ncluster; i++) {
+			if (cluster_groups[i].ngroups < min_grps)
+				min_grps =3D cluster_groups[i].ngroups;
+		}
+
+		for (i =3D 0; i < ncluster; i++) {
+			if (cluster_groups[i].ngroups > min_grps) {
+				total_extra +=3D
+					cluster_groups[i].ngroups - min_grps;
+				cluster_groups[i].ngroups =3D min_grps;
+			}
+		}
+
+		/*
+		 * Redistribute extras using a stride to scatter them
+		 * across clusters.  With stride =3D ncluster / extras,
+		 * consecutive callers' extra sets overlap minimally
+		 * (e.g. max 2 overlap for 6 callers with 24 clusters
+		 * and 7 extras, vs 6 overlap with stride 1).
+		 */
+		start =3D spread_offset % ncluster;
+		stride =3D (total_extra > 0 && total_extra < ncluster) ?
+			 ncluster / total_extra : 1;
+
+		for (i =3D 0; i < ncluster && total_extra > 0; i++) {
+			unsigned int idx =3D
+				(start + i * stride) % ncluster;
+			unsigned int cap;
+
+			cap =3D cpumask_weight_and(clusters[cluster_groups[idx].id],
+						 node_cpumask);
+			if (cluster_groups[idx].ngroups < cap) {
+				cluster_groups[idx].ngroups++;
+				total_extra--;
+			}
+		}
+
+		/* Fallback: place remaining extras wherever they fit */
+		for (i =3D 0; i < ncluster && total_extra > 0; i++) {
+			unsigned int cap;
+
+			cap =3D cpumask_weight_and(clusters[cluster_groups[i].id],
+						 node_cpumask);
+			if (cluster_groups[i].ngroups < cap) {
+				cluster_groups[i].ngroups++;
+				total_extra--;
+			}
+		}
+	}
+
 	for (i =3D 0; i < ncluster; i++) {
 		struct node_groups *nv =3D &cluster_groups[i];
=20
@@ -389,7 +475,8 @@ static bool __try_group_cluster_cpus(unsigned int ncpus,
 			continue;
 		WARN_ON_ONCE(nv->ngroups > nc);
=20
-		assign_cpus_to_groups(nc, nmsk, nv, masks, curgrp, last_grp);
+		assign_cpus_to_groups(nc, nmsk, nv, masks, curgrp, last_grp,
+				      spread_offset);
 	}
=20
 	ret =3D true;
@@ -404,7 +491,8 @@ static bool __try_group_cluster_cpus(unsigned int ncpus,
 static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 			       cpumask_var_t *node_to_cpumask,
 			       const struct cpumask *cpu_mask,
-			       struct cpumask *nmsk, struct cpumask *masks)
+			       struct cpumask *nmsk, struct cpumask *masks,
+			       unsigned int spread_offset)
 {
 	unsigned int i, n, nodes, done =3D 0;
 	unsigned int last_grp =3D numgrps;
@@ -455,13 +543,14 @@ static int __group_cpus_evenly(unsigned int startgrp,=
 unsigned int numgrps,
 		WARN_ON_ONCE(nv->ngroups > ncpus);
=20
 		if (__try_group_cluster_cpus(ncpus, nv->ngroups, nmsk,
-					     masks, &curgrp, last_grp)) {
+					     masks, &curgrp, last_grp,
+					     spread_offset)) {
 			done +=3D nv->ngroups;
 			continue;
 		}
=20
 		assign_cpus_to_groups(ncpus, nmsk, nv, masks, &curgrp,
-				      last_grp);
+				      last_grp, spread_offset);
 		done +=3D nv->ngroups;
 	}
 	kfree(node_groups);
@@ -488,6 +577,7 @@ static int __group_cpus_evenly(unsigned int startgrp, u=
nsigned int numgrps,
 struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *numm=
asks)
 {
 	unsigned int curgrp =3D 0, nr_present =3D 0, nr_others =3D 0;
+	unsigned int spread_offset;
 	cpumask_var_t *node_to_cpumask;
 	cpumask_var_t nmsk, npresmsk;
 	int ret =3D -ENOMEM;
@@ -510,6 +600,8 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps,=
 unsigned int *nummasks)
 	if (!masks)
 		goto fail_node_to_cpumask;
=20
+	spread_offset =3D (unsigned int)atomic_fetch_inc(&group_spread_cnt);
+
 	build_node_to_cpumask(node_to_cpumask);
=20
 	/*
@@ -528,7 +620,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps,=
 unsigned int *nummasks)
=20
 	/* grouping present CPUs first */
 	ret =3D __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
-				  npresmsk, nmsk, masks);
+				  npresmsk, nmsk, masks, spread_offset);
 	if (ret < 0)
 		goto fail_node_to_cpumask;
 	nr_present =3D ret;
@@ -545,7 +637,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps,=
 unsigned int *nummasks)
 		curgrp =3D nr_present;
 	cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
 	ret =3D __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
-				  npresmsk, nmsk, masks);
+				  npresmsk, nmsk, masks, spread_offset);
 	if (ret >=3D 0)
 		nr_others =3D ret;
=20

base-commit: c369299895a591d96745d6492d4888259b004a9e
--=20
2.43.0