From nobody Sat Feb 7 17:55:25 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 28927191F91 for ; Thu, 5 Feb 2026 02:49:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770259761; cv=none; b=eaF/WJr7onVriEN21gEqB7K0nARdKz5a3uicGeHIlfaNeuXsk8X9nYCxyIEi4fSQ3N8uuXkuk4fNSEDQwD56thOvXha0PQ6AVnWeW86hXJ1gjwHfzHduokKyisDejQ7Ahvey+IlQIGDzwzwk/16fbcp19DjvlT1nxmDfx10Vxac= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770259761; c=relaxed/simple; bh=Iw29Y2s9BdmWVZLHxv+nlXQgwmPbWn2nfesgCf2GNV4=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=sC49KaqsJWf/PeiEWCYmSxbjB3DjJbiPFxYgFR8yd5An60cg61cDSt4tt2tuyjJeJoxzgXi8/YGgHl5lQbWgsiY8mBQ4suv9eQxwzZwdN72UKA4DKHBzWaxIUdp7wy4duGIHpnIrQN80n4VAgE0aOPPQFQl/oYSlgdvWPn5eiFY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=B0443Cc+; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="B0443Cc+" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4E233C4CEF7; Thu, 5 Feb 2026 02:49:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1770259760; bh=Iw29Y2s9BdmWVZLHxv+nlXQgwmPbWn2nfesgCf2GNV4=; h=From:To:Cc:Subject:Date:From; b=B0443Cc+QyILKfNRKcEzQxQxiTLXRKXFf3MjHJOGWoXdrkbSPGkrSuQ7mKaqk364r BhCq7BQLTDfdbUupcCOM+kWDV6f8bckuHWYdYZaYrGCjcLXY6n7SapOvf4Cpvn8ot5 +KPcrMtDnEWyGXzGSrvkP7SJZJFTD+sVvvQMMgl8PJrJthQ/1ltuwcn3325IngHhRk V0vZC1tTxEWrmzJQauxfq3cFfw24pCcJmh4N6F45ma5KNy5snp4GGh4ruq1xt92FgB I/MTep9xdDmpDObuy349yFMCs8lFEP8iQOx6aystrhRQC6bm/zvClnj/LLb8IUD1Df oPUuVQvCukQGw== From: Chuck Lever To: tj@kernel.org, jiangshanlai@gmail.com Cc: , Chuck Lever Subject: [PATCH v2] workqueue: Automatic affinity scope fallback for single-pod topologies Date: Wed, 4 Feb 2026 21:49:11 -0500 Message-ID: <20260205024912.6753-1-cel@kernel.org> X-Mailer: git-send-email 2.52.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chuck Lever The default affinity scope WQ_AFFN_CACHE assumes systems have multiple last-level caches. On systems where all CPUs share a single LLC (common with Intel monolithic dies), this scope degenerates to a single worker pool. All queue_work() calls then contend on that pool's single lock, causing severe performance degradation under high-throughput workloads. For example, on a 12-core system with a single shared L3 cache running NFS over RDMA with 12 fio jobs, perf shows approximately 39% of CPU cycles spent in native_queued_spin_lock_slowpath, nearly all from __queue_work() contending on the single pool lock. On such systems WQ_AFFN_CACHE, WQ_AFFN_SMT, and WQ_AFFN_NUMA scopes all collapse to a single pod. Add wq_effective_affn_scope() to detect when a selected affinity scope provides only one pod despite having multiple CPUs, and automatically fall back to a finer-grained scope. This enables lock distribution to scale with CPU count without requiring manual configuration via the workqueue.default_affinity_scope parameter or per-workqueue sysfs tuning. The fallback is conservative: it triggers only when a scope degenerates to exactly one pod, and respects explicitly configured (non-default) scopes. Also update wq_affn_scope_show() to display the effective scope when fallback occurs, making the behavior transparent to administrators via sysfs (e.g., "default (cache -> smt)"). Signed-off-by: Chuck Lever --- Changes since RFC: - Add a new affinity scope between CPU and CACHE include/linux/workqueue.h | 8 ++++- kernel/workqueue.c | 68 +++++++++++++++++++++++++++++++++++---- 2 files changed, 69 insertions(+), 7 deletions(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index dabc351cc127..1fca5791337d 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -128,10 +128,16 @@ struct rcu_work { struct workqueue_struct *wq; }; =20 +/* + * Affinity scopes are ordered from finest to coarsest granularity. This + * ordering is used by the automatic fallback logic in wq_effective_affn_s= cope() + * which walks from coarse toward fine when a scope degenerates to a singl= e pod. + */ enum wq_affn_scope { WQ_AFFN_DFL, /* use system default */ WQ_AFFN_CPU, /* one pod per CPU */ - WQ_AFFN_SMT, /* one pod poer SMT */ + WQ_AFFN_SMT, /* one pod per SMT */ + WQ_AFFN_CLUSTER, /* one pod per cluster */ WQ_AFFN_CACHE, /* one pod per LLC */ WQ_AFFN_NUMA, /* one pod per NUMA node */ WQ_AFFN_SYSTEM, /* one pod across the whole system */ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 253311af47c6..32598b9cd1c2 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -405,6 +405,7 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] =3D { [WQ_AFFN_DFL] =3D "default", [WQ_AFFN_CPU] =3D "cpu", [WQ_AFFN_SMT] =3D "smt", + [WQ_AFFN_CLUSTER] =3D "cluster", [WQ_AFFN_CACHE] =3D "cache", [WQ_AFFN_NUMA] =3D "numa", [WQ_AFFN_SYSTEM] =3D "system", @@ -4753,6 +4754,39 @@ static void wqattrs_actualize_cpumask(struct workque= ue_attrs *attrs, cpumask_copy(attrs->cpumask, unbound_cpumask); } =20 +/* + * Determine the effective affinity scope. If the configured scope results + * in a single pod (e.g., WQ_AFFN_CACHE on a system with one shared LLC), + * fall back to a finer-grained scope to distribute pool lock contention. + * + * The search stops at WQ_AFFN_CPU, which always provides one pod per CPU + * and thus cannot degenerate further. + * + * Returns the scope to actually use, which may differ from the configured + * scope on systems where coarser scopes degenerate. + */ +static enum wq_affn_scope wq_effective_affn_scope(enum wq_affn_scope scope) +{ + struct wq_pod_type *pt; + + /* + * Walk from the requested scope toward finer granularity. Stop + * when a scope provides more than one pod, or when CPU scope is + * reached. CPU scope always provides nr_possible_cpus() pods. + */ + while (scope > WQ_AFFN_CPU) { + pt =3D &wq_pod_types[scope]; + + /* Multiple pods at this scope; no fallback needed */ + if (pt->nr_pods > 1) + break; + + scope--; + } + + return scope; +} + /* find wq_pod_type to use for @attrs */ static const struct wq_pod_type * wqattrs_pod_type(const struct workqueue_attrs *attrs) @@ -4763,8 +4797,13 @@ wqattrs_pod_type(const struct workqueue_attrs *attrs) /* to synchronize access to wq_affn_dfl */ lockdep_assert_held(&wq_pool_mutex); =20 + /* + * For default scope, apply automatic fallback for degenerate + * topologies. Explicit scope selection via sysfs or per-workqueue + * attributes bypasses fallback, preserving administrator intent. + */ if (attrs->affn_scope =3D=3D WQ_AFFN_DFL) - scope =3D wq_affn_dfl; + scope =3D wq_effective_affn_scope(wq_affn_dfl); else scope =3D attrs->affn_scope; =20 @@ -7206,16 +7245,27 @@ static ssize_t wq_affn_scope_show(struct device *de= v, struct device_attribute *attr, char *buf) { struct workqueue_struct *wq =3D dev_to_wq(dev); + enum wq_affn_scope scope, effective; int written; =20 mutex_lock(&wq->mutex); - if (wq->unbound_attrs->affn_scope =3D=3D WQ_AFFN_DFL) - written =3D scnprintf(buf, PAGE_SIZE, "%s (%s)\n", - wq_affn_names[WQ_AFFN_DFL], - wq_affn_names[wq_affn_dfl]); - else + if (wq->unbound_attrs->affn_scope =3D=3D WQ_AFFN_DFL) { + scope =3D wq_affn_dfl; + effective =3D wq_effective_affn_scope(scope); + if (wq_pod_types[effective].nr_pods > + wq_pod_types[scope].nr_pods) + written =3D scnprintf(buf, PAGE_SIZE, "%s (%s -> %s)\n", + wq_affn_names[WQ_AFFN_DFL], + wq_affn_names[scope], + wq_affn_names[effective]); + else + written =3D scnprintf(buf, PAGE_SIZE, "%s (%s)\n", + wq_affn_names[WQ_AFFN_DFL], + wq_affn_names[scope]); + } else { written =3D scnprintf(buf, PAGE_SIZE, "%s\n", wq_affn_names[wq->unbound_attrs->affn_scope]); + } mutex_unlock(&wq->mutex); =20 return written; @@ -8023,6 +8073,11 @@ static bool __init cpus_share_smt(int cpu0, int cpu1) #endif } =20 +static bool __init cpus_share_cluster(int cpu0, int cpu1) +{ + return cpumask_test_cpu(cpu0, topology_cluster_cpumask(cpu1)); +} + static bool __init cpus_share_numa(int cpu0, int cpu1) { return cpu_to_node(cpu0) =3D=3D cpu_to_node(cpu1); @@ -8042,6 +8097,7 @@ void __init workqueue_init_topology(void) =20 init_pod_type(&wq_pod_types[WQ_AFFN_CPU], cpus_dont_share); init_pod_type(&wq_pod_types[WQ_AFFN_SMT], cpus_share_smt); + init_pod_type(&wq_pod_types[WQ_AFFN_CLUSTER], cpus_share_cluster); init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache); init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa); =20 --=20 2.52.0