From nobody Sun Apr 5 13:43:57 2026 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by smtp.subspace.kernel.org (Postfix) with ESMTP id AE6773502B8 for ; Tue, 24 Mar 2026 07:53:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=13.77.154.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774338849; cv=none; b=TI1kG+Gxv5d1vkRbfadE5DbIhs1oEKJEHS2kMOL4zZ3gSmWaxjEor5mEdW1EA8eqU7zcCcf1UjV4hrQ3m9dve8JFjtHw5NrZhYQpEA1ybPem0FeSWXR+Gzan31EsqAzIjPmIApxNVY3nmNBH6VQm+bpvZCNhqYSEVF+K/YPQb9w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774338849; c=relaxed/simple; bh=TJqsqi7VwMOWA+fwKyLel7poNXhIRNkNp8TAZhVJeeo=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=Sw7cxFFUDCDbs8ko1b/v862pYMj54SYtdDB/oI7tRbuVm5l8hS7wSe6JI7Ko7CtSCfpEjylFCFiATZ76ZSKXdx3GqqSIfSpfJ6Q+juds8Y1TXWM4NNTIqcZBJxEw70TpdpB3XlBN18lXVO5cCfHOPPlLLi7u8vZ/0wcjN+/epj4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.microsoft.com; spf=pass smtp.mailfrom=linux.microsoft.com; dkim=pass (1024-bit key) header.d=linux.microsoft.com header.i=@linux.microsoft.com header.b=LS3uMkyZ; arc=none smtp.client-ip=13.77.154.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.microsoft.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.microsoft.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.microsoft.com header.i=@linux.microsoft.com header.b="LS3uMkyZ" Received: from CPC-namja-026ON.redmond.corp.microsoft.com (unknown [4.213.232.17]) by linux.microsoft.com (Postfix) with ESMTPSA id 5088820B710C; Tue, 24 Mar 2026 00:53:56 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 5088820B710C DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1774338838; bh=d9Zo/tJk8IGu8TPDmZCE9nCq1pnLtOfp2X5a9LuvXF0=; h=From:To:Cc:Subject:Date:From; b=LS3uMkyZDxzYM2HGZxKHWBDvcDb9xLzY8ze3sCr4UrAHgwtOYc7QMy0V3mpzNSK2x K84cUWFMRBEscqkNTzlYNnfG//TV7s+xCsX6bp073PmtzXQnyMJ5ctQviYqQHL8mg2 kp/1kRJIYAQxHg0bgorr/hqJNiBl245JxOBXQsIc= From: Naman Jain To: Thomas Gleixner , Andrew Morton Cc: Tianyou Li , Wangyang Guo , Tim Chen , linux-kernel@vger.kernel.org, Long Li , Naman Jain Subject: [PATCH] lib/group_cpus: rotate extra groups to avoid IRQ stacking Date: Tue, 24 Mar 2026 07:53:52 +0000 Message-ID: <20260324075352.2326972-1-namjain@linux.microsoft.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" When multiple devices call group_cpus_evenly() with the same number of groups, the cluster-aware path in __try_group_cluster_cpus() assigns extra groups to the same set of clusters every time, producing identical affinity masks for every caller. CPUs in clusters that receive two groups (and thus get single-CPU dedicated masks) end up handling interrupts from ALL devices, creating an IRQ imbalance. For example, on a 96-CPU / 2-NUMA-node system with 24 clusters of 2 CPUs each and 6 NVMe disks each requesting 62 vectors: alloc_groups_to_nodes() distributes 31 groups across 24 clusters, giving 7 clusters 2 groups (single-CPU mask =3D dedicated) and 17 clusters 1 group (2-CPU mask =3D shared). Because the assignment is deterministic, all 6 disks produce the same mapping and the same 14 CPUs each accumulate 6 dedicated IRQs -- roughly twice the interrupt load of other CPUs -- causing up to 11% per-disk throughput degradation on IRQ-heavy CPUs. Fix this by introducing a per-caller rotation offset via a static atomic counter. After alloc_groups_to_nodes() determines each cluster's group count, collect the extras (groups above the per-cluster minimum), then redistribute them starting from a rotated position with a stride of ncluster/extras so that successive callers scatter their extra groups across different clusters. A capacity check (cpumask_weight_and) ensures no cluster is assigned more groups than it has CPUs, with a fallback loop for any extras that could not be placed in the strided pass. For systems without cluster topology, the same rotation is applied in assign_cpus_to_groups() at the per-group level: the modular expression (v + spread_offset) % nv->ngroups selects which groups receive the extra CPU, replacing the previous sequential decrement. Tested on a 96-vCPU Hyper-V VM (AMD EPYC 9V74, 48 clusters of 2, 2 NUMA nodes) with 6 NVMe data disks (63 MSI-X vectors each), running 4K random-read fio (psync, 16 jobs/disk, 120s, CPU-pinned): - Worst-disk degradation vs average: 11% -> 5% - Previously penalized disks: +12% IOPS, 10% lower latency Fixes: 89802ca36c96 ("lib/group_cpus: make group CPU cluster aware") Co-developed-by: Long Li Signed-off-by: Long Li Signed-off-by: Naman Jain --- lib/group_cpus.c | 116 ++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 104 insertions(+), 12 deletions(-) diff --git a/lib/group_cpus.c b/lib/group_cpus.c index e6e18d7a49bb..241c2c9b437d 100644 --- a/lib/group_cpus.c +++ b/lib/group_cpus.c @@ -7,6 +7,7 @@ #include #include #include +#include #include =20 #ifdef CONFIG_SMP @@ -255,12 +256,20 @@ static void alloc_nodes_groups(unsigned int numgrps, alloc_groups_to_nodes(numgrps, numcpus, node_groups, nr_node_ids); } =20 +/* + * Per-caller rotation counter for group_cpus_evenly(). + * Wrapping is harmless: the offset is only used modulo small values + * (ncluster or nv->ngroups), so any unsigned value works. + */ +static atomic_t group_spread_cnt =3D ATOMIC_INIT(0); + static void assign_cpus_to_groups(unsigned int ncpus, struct cpumask *nmsk, struct node_groups *nv, struct cpumask *masks, unsigned int *curgrp, - unsigned int last_grp) + unsigned int last_grp, + unsigned int spread_offset) { unsigned int v, cpus_per_grp, extra_grps; /* Account for rounding errors */ @@ -270,11 +279,15 @@ static void assign_cpus_to_groups(unsigned int ncpus, for (v =3D 0; v < nv->ngroups; v++, *curgrp +=3D 1) { cpus_per_grp =3D ncpus / nv->ngroups; =20 - /* Account for extra groups to compensate rounding errors */ - if (extra_grps) { + /* + * Rotate which groups get the extra CPU so that + * successive callers produce different mappings, + * avoiding IRQ stacking when multiple devices + * share the same CPU topology. + */ + if (extra_grps && + (v + spread_offset) % nv->ngroups < extra_grps) cpus_per_grp++; - --extra_grps; - } =20 /* * wrapping has to be considered given 'startgrp' @@ -361,7 +374,8 @@ static bool __try_group_cluster_cpus(unsigned int ncpus, struct cpumask *node_cpumask, struct cpumask *masks, unsigned int *curgrp, - unsigned int last_grp) + unsigned int last_grp, + unsigned int spread_offset) { struct node_groups *cluster_groups; const struct cpumask **clusters; @@ -379,6 +393,78 @@ static bool __try_group_cluster_cpus(unsigned int ncpu= s, if (ncluster =3D=3D 0) goto fail_no_clusters; =20 + /* + * Rotate which clusters receive extra groups so that different + * callers of group_cpus_evenly() produce different group-to-CPU + * mappings. Without this, all devices get identical affinity + * masks, causing IRQ stacking on CPUs assigned single-CPU groups. + * + * alloc_groups_to_nodes() distributes ngroups proportionally, but + * rounding causes some clusters to get one more group than others. + * The assignment is deterministic, so every device gets the same + * mapping. Fix: collect the "extra" groups (above the per-cluster + * minimum), then redistribute them starting from a rotated + * position. The ncpus constraint (ngroups <=3D ncpus) is preserved + * by checking each cluster's actual CPU count before adding. + * + * Note: after alloc_groups_to_nodes(), the union field holds + * ngroups (not ncpus), so we recompute ncpus from cluster masks. + */ + if (ncluster > 1) { + unsigned int min_grps =3D UINT_MAX; + unsigned int total_extra =3D 0; + unsigned int start, stride; + + for (i =3D 0; i < ncluster; i++) { + if (cluster_groups[i].ngroups < min_grps) + min_grps =3D cluster_groups[i].ngroups; + } + + for (i =3D 0; i < ncluster; i++) { + if (cluster_groups[i].ngroups > min_grps) { + total_extra +=3D + cluster_groups[i].ngroups - min_grps; + cluster_groups[i].ngroups =3D min_grps; + } + } + + /* + * Redistribute extras using a stride to scatter them + * across clusters. With stride =3D ncluster / extras, + * consecutive callers' extra sets overlap minimally + * (e.g. max 2 overlap for 6 callers with 24 clusters + * and 7 extras, vs 6 overlap with stride 1). + */ + start =3D spread_offset % ncluster; + stride =3D (total_extra > 0 && total_extra < ncluster) ? + ncluster / total_extra : 1; + + for (i =3D 0; i < ncluster && total_extra > 0; i++) { + unsigned int idx =3D + (start + i * stride) % ncluster; + unsigned int cap; + + cap =3D cpumask_weight_and(clusters[cluster_groups[idx].id], + node_cpumask); + if (cluster_groups[idx].ngroups < cap) { + cluster_groups[idx].ngroups++; + total_extra--; + } + } + + /* Fallback: place remaining extras wherever they fit */ + for (i =3D 0; i < ncluster && total_extra > 0; i++) { + unsigned int cap; + + cap =3D cpumask_weight_and(clusters[cluster_groups[i].id], + node_cpumask); + if (cluster_groups[i].ngroups < cap) { + cluster_groups[i].ngroups++; + total_extra--; + } + } + } + for (i =3D 0; i < ncluster; i++) { struct node_groups *nv =3D &cluster_groups[i]; =20 @@ -389,7 +475,8 @@ static bool __try_group_cluster_cpus(unsigned int ncpus, continue; WARN_ON_ONCE(nv->ngroups > nc); =20 - assign_cpus_to_groups(nc, nmsk, nv, masks, curgrp, last_grp); + assign_cpus_to_groups(nc, nmsk, nv, masks, curgrp, last_grp, + spread_offset); } =20 ret =3D true; @@ -404,7 +491,8 @@ static bool __try_group_cluster_cpus(unsigned int ncpus, static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps, cpumask_var_t *node_to_cpumask, const struct cpumask *cpu_mask, - struct cpumask *nmsk, struct cpumask *masks) + struct cpumask *nmsk, struct cpumask *masks, + unsigned int spread_offset) { unsigned int i, n, nodes, done =3D 0; unsigned int last_grp =3D numgrps; @@ -455,13 +543,14 @@ static int __group_cpus_evenly(unsigned int startgrp,= unsigned int numgrps, WARN_ON_ONCE(nv->ngroups > ncpus); =20 if (__try_group_cluster_cpus(ncpus, nv->ngroups, nmsk, - masks, &curgrp, last_grp)) { + masks, &curgrp, last_grp, + spread_offset)) { done +=3D nv->ngroups; continue; } =20 assign_cpus_to_groups(ncpus, nmsk, nv, masks, &curgrp, - last_grp); + last_grp, spread_offset); done +=3D nv->ngroups; } kfree(node_groups); @@ -488,6 +577,7 @@ static int __group_cpus_evenly(unsigned int startgrp, u= nsigned int numgrps, struct cpumask *group_cpus_evenly(unsigned int numgrps, unsigned int *numm= asks) { unsigned int curgrp =3D 0, nr_present =3D 0, nr_others =3D 0; + unsigned int spread_offset; cpumask_var_t *node_to_cpumask; cpumask_var_t nmsk, npresmsk; int ret =3D -ENOMEM; @@ -510,6 +600,8 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps,= unsigned int *nummasks) if (!masks) goto fail_node_to_cpumask; =20 + spread_offset =3D (unsigned int)atomic_fetch_inc(&group_spread_cnt); + build_node_to_cpumask(node_to_cpumask); =20 /* @@ -528,7 +620,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps,= unsigned int *nummasks) =20 /* grouping present CPUs first */ ret =3D __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, - npresmsk, nmsk, masks); + npresmsk, nmsk, masks, spread_offset); if (ret < 0) goto fail_node_to_cpumask; nr_present =3D ret; @@ -545,7 +637,7 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps,= unsigned int *nummasks) curgrp =3D nr_present; cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk); ret =3D __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, - npresmsk, nmsk, masks); + npresmsk, nmsk, masks, spread_offset); if (ret >=3D 0) nr_others =3D ret; =20 base-commit: c369299895a591d96745d6492d4888259b004a9e --=20 2.43.0