From nobody Sun Feb 8 13:09:03 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 662873AA1A5 for ; Tue, 3 Feb 2026 14:37:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770129467; cv=none; b=b+n5SZT5ON6ljMuo2WQsWCwUq3KOQnre1W5wzrfablPYKxCVIokRqAWyiXm44brkKiEQrFxRg2hGCaJuwjOOTsNGduZezzUnVkRy1/NQnET/a2RFZukFKMil+qKY6tmCL12t/LPQM/4ChDcyuJmbSS0lD4n3shPZGc02eRocWK4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770129467; c=relaxed/simple; bh=5yYYz+2TFhP3uga0UEzD1ft0U1DeH0YiprnaSPfygEU=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=McYf6G3TPk0Ne1VAvG1patF4Si3Znilb8cwEEKZDeZa9sIvPUWTqFjJ9QEZ1/Y2FpL6sbATvzCz9WDSRYoWO9i/HhtnBah3qJEyyyv/Mz8Pn2adhaGdnRF1w1E6+c+Qqp6N0x9uErp+iZtJMJubgZT2c0s6U4G1mmlrW9nniL68= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=UDpEqSxY; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="UDpEqSxY" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A3081C116D0; Tue, 3 Feb 2026 14:37:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1770129467; bh=5yYYz+2TFhP3uga0UEzD1ft0U1DeH0YiprnaSPfygEU=; h=From:To:Cc:Subject:Date:From; b=UDpEqSxYvW1ycU85FRg7NsFd+QLqjMwl69rV4l5Y2fa1PzS3nfuY/6d5xGXih4q7H OEJ8fKoXlqqLaK5JSWp/BCQWnrs66/nDh5Hb0U9AtohUMdEakUw5OELOl+g54DdlJ1 L+Se9Tg4A5BNx/XBNpNNpanznStqqmasF5sNNG8+jnkMPunYZV59ApMXD5/9rLnZsl MF018NzLl8UehQx2zjlGvqMveINi3cP0OmiBXyQyZDKdz40D9h5GlSHQP+aIyIWcNC asuESmHF69RykzwUrCwyE+f2lkCPV9BYosfclsdD/VxwuaJ7fukvLWIOcKb71Eh9Xs bYpKAGZM4emPQ== From: Chuck Lever To: tj@kernel.org, jiangshanlai@gmail.com Cc: , Chuck Lever Subject: [RFC PATCH] workqueue: Automatic affinity scope fallback for single-pod topologies Date: Tue, 3 Feb 2026 09:37:44 -0500 Message-ID: <20260203143744.16578-1-cel@kernel.org> X-Mailer: git-send-email 2.52.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chuck Lever The default affinity scope WQ_AFFN_CACHE assumes systems have multiple last-level caches. On systems where all CPUs share a single LLC (common with Intel monolithic dies), this scope degenerates to a single worker pool. All queue_work() calls then contend on that pool's single lock, causing severe performance degradation under high-throughput workloads. For example, on a 12-core system with a single shared L3 cache running NFS over RDMA with 12 fio jobs, perf shows approximately 39% of CPU cycles spent in native_queued_spin_lock_slowpath, nearly all from __queue_work() contending on the single pool lock. On such systems WQ_AFFN_CACHE, WQ_AFFN_SMT, and WQ_AFFN_NUMA scopes all collapse to a single pod. Add wq_effective_affn_scope() to detect when a selected affinity scope provides only one pod despite having multiple CPUs, and automatically fall back to a finer-grained scope. This ensures reasonable lock distribution without requiring manual configuration via the workqueue.default_affinity_scope parameter or per-workqueue sysfs tuning. The fallback is conservative: it triggers only when a scope degenerates to exactly one pod, and respects explicitly configured (non-default) scopes. Also update wq_affn_scope_show() to display the effective scope when fallback occurs, making the behavior transparent to administrators via sysfs (e.g., "default (cache -> smt)"). Signed-off-by: Chuck Lever --- include/linux/workqueue.h | 7 ++++- kernel/workqueue.c | 60 +++++++++++++++++++++++++++++++++++---- 2 files changed, 60 insertions(+), 7 deletions(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index dabc351cc127..130c452fcecf 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -128,10 +128,15 @@ struct rcu_work { struct workqueue_struct *wq; }; =20 +/* + * Affinity scopes are ordered from finest to coarsest granularity. This + * ordering is used by the automatic fallback logic in wq_effective_affn_s= cope() + * which walks from coarse toward fine when a scope degenerates to a singl= e pod. + */ enum wq_affn_scope { WQ_AFFN_DFL, /* use system default */ WQ_AFFN_CPU, /* one pod per CPU */ - WQ_AFFN_SMT, /* one pod poer SMT */ + WQ_AFFN_SMT, /* one pod per SMT */ WQ_AFFN_CACHE, /* one pod per LLC */ WQ_AFFN_NUMA, /* one pod per NUMA node */ WQ_AFFN_SYSTEM, /* one pod across the whole system */ diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 253311af47c6..efbc10ef79fb 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -4753,6 +4753,39 @@ static void wqattrs_actualize_cpumask(struct workque= ue_attrs *attrs, cpumask_copy(attrs->cpumask, unbound_cpumask); } =20 +/* + * Determine the effective affinity scope. If the configured scope results + * in a single pod (e.g., WQ_AFFN_CACHE on a system with one shared LLC), + * fall back to a finer-grained scope to distribute pool lock contention. + * + * The search stops at WQ_AFFN_CPU, which always provides one pod per CPU + * and thus cannot degenerate further. + * + * Returns the scope to actually use, which may differ from the configured + * scope on systems where coarser scopes degenerate. + */ +static enum wq_affn_scope wq_effective_affn_scope(enum wq_affn_scope scope) +{ + struct wq_pod_type *pt; + + /* + * Walk from the requested scope toward finer granularity. Stop when + * a scope provides more than one pod, or when CPU scope is reached. + * CPU scope always provides nr_possible_cpus() pods. + */ + while (scope > WQ_AFFN_CPU) { + pt =3D &wq_pod_types[scope]; + + /* Multiple pods at this scope; no fallback needed */ + if (pt->nr_pods > 1) + break; + + scope--; + } + + return scope; +} + /* find wq_pod_type to use for @attrs */ static const struct wq_pod_type * wqattrs_pod_type(const struct workqueue_attrs *attrs) @@ -4763,8 +4796,13 @@ wqattrs_pod_type(const struct workqueue_attrs *attrs) /* to synchronize access to wq_affn_dfl */ lockdep_assert_held(&wq_pool_mutex); =20 + /* + * For default scope, apply automatic fallback for degenerate + * topologies. Explicit scope selection via sysfs or per-workqueue + * attributes bypasses fallback, preserving administrator intent. + */ if (attrs->affn_scope =3D=3D WQ_AFFN_DFL) - scope =3D wq_affn_dfl; + scope =3D wq_effective_affn_scope(wq_affn_dfl); else scope =3D attrs->affn_scope; =20 @@ -7206,16 +7244,26 @@ static ssize_t wq_affn_scope_show(struct device *de= v, struct device_attribute *attr, char *buf) { struct workqueue_struct *wq =3D dev_to_wq(dev); + enum wq_affn_scope scope, effective; int written; =20 mutex_lock(&wq->mutex); - if (wq->unbound_attrs->affn_scope =3D=3D WQ_AFFN_DFL) - written =3D scnprintf(buf, PAGE_SIZE, "%s (%s)\n", - wq_affn_names[WQ_AFFN_DFL], - wq_affn_names[wq_affn_dfl]); - else + if (wq->unbound_attrs->affn_scope =3D=3D WQ_AFFN_DFL) { + scope =3D wq_affn_dfl; + effective =3D wq_effective_affn_scope(scope); + if (effective !=3D scope) + written =3D scnprintf(buf, PAGE_SIZE, "%s (%s -> %s)\n", + wq_affn_names[WQ_AFFN_DFL], + wq_affn_names[scope], + wq_affn_names[effective]); + else + written =3D scnprintf(buf, PAGE_SIZE, "%s (%s)\n", + wq_affn_names[WQ_AFFN_DFL], + wq_affn_names[scope]); + } else { written =3D scnprintf(buf, PAGE_SIZE, "%s\n", wq_affn_names[wq->unbound_attrs->affn_scope]); + } mutex_unlock(&wq->mutex); =20 return written; --=20 2.52.0