From nobody Sat Feb  7 18:21:15 2026
Received: from mail-dy1-f173.google.com (mail-dy1-f173.google.com
 [74.125.82.173])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 20A9C270568
	for <linux-kernel@vger.kernel.org>; Thu, 29 Jan 2026 03:20:09 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=74.125.82.173
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1769656811; cv=none;
 b=h5JBsaYZc632OiYTYozUtK4ASaMSu+GH5rSXQHjUdWEKylwrotCpGACKmyFVNN5L5pc3rK+gKIflBrLc05xM3OkjU68pSfMkHXgG6QhxWp+9E5PdOD6qXPUpXUQtZBJ8IKYJRdrBczrXq4SVur4z1dReUXF1Xo4N2ilXxMAJSn4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1769656811; c=relaxed/simple;
	bh=nxSnZz+up5ojPHN4EpAcH0h98d0Q5Z6PzVdtOELc8C8=;
	h=From:To:Cc:Subject:Date:Message-ID:MIME-Version;
 b=Lh548LWbhbzj1Zl+L+G7FWunkTPIfPqjq8Xoz7LgVmgGOm6xhYubfDEDdP7WHWKl37MtCM1OByIjwhoYO3ss4JVR7ABUUNhxMFTkBzBJBF/EoRwf9omC34y6I8r70UqsWVA3HGFo7qLCEZEj0zEgp43rgc9FmMAhBub0UCo/tT4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=Yo7V6gvW; arc=none smtp.client-ip=74.125.82.173
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="Yo7V6gvW"
Received: by mail-dy1-f173.google.com with SMTP id
 5a478bee46e88-2b70abe3417so1386629eec.0
        for <linux-kernel@vger.kernel.org>;
 Wed, 28 Jan 2026 19:20:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1769656809; x=1770261609;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=ZmsKpegoOuoFiHJpGwvYTLP4ZAdpyMRNMKAeyZSfB6c=;
        b=Yo7V6gvWvZKylHSZ+aTwXohu2VxwNqMteY5J2U88jmzm1UzHCnBfyUewEEy6+YrJ2S
         NDBbtTUvZOT5RzleqGREz9VihJwR8vdYgVUxmxBDpBuGbfvr2MzulU6koYqwpY7gImtg
         LQ7zDsPQoVuv/YXmSQpK4+7EqM3xzhSEH6MMg7x5xuW30lCYH75KhXSCu9U2+gQv+UB/
         DmXXmuG1nekIRobrLxyiW0szRbBuixYqxKiHQyZe4ZSg2MDZdTOfZw4aPqrwMeA/vhsH
         p1MR4fMm4hYiQpY3/K0ZcNLmizkGSnc54aSZCnA8eae4MpR5ZpFPwbshsDQ6hDoxmDbv
         bkFg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1769656809; x=1770261609;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=ZmsKpegoOuoFiHJpGwvYTLP4ZAdpyMRNMKAeyZSfB6c=;
        b=c5riNuWgwsrcVsBUo91CGdD2DJV0EVdgB0xZngGPCwblZ/XBP/DUdeZaZv/qEACT9n
         7uxKUi4urrX6FPF4KCSDx8a73WjwY1fpsJjpEVwLGS2Xqte+mok3D7mjgFO5Rjk/HOE7
         IChr+x2bkWi/GiYSSPZPg31qaanH+c9vRAPuVQD2MaGxUyOytkero/LAkhIbyW9bbyG3
         UkZhSU9x1N7pyS+7yGX4/0yybV2Kkp06bzw7CCyVd3COaBsJiXgG/jD9Vlfel3AB6WHt
         /5I0tsAfQryBIRktPvBT8X4vUgo4Xrvs3z919RbYr+N04Lh3eu7uEZMaraWfb8AHfszK
         ar9g==
X-Forwarded-Encrypted: i=1;
 AJvYcCW0zMvATmdRMh2KFClWyszaECu1t+FEbfw6SmM62BiZ3R4y4hXR0bqLpRArxqFza25mcHQ+gX8XLYK/PPc=@vger.kernel.org
X-Gm-Message-State: AOJu0YxyW1riGUxFMNF+mYB2Ql2/+W62jxKDingBTFJtwSpegxjlXEqH
	tHDSXBoh+HhuDuXhGmt2nMRIBivofR/7ZsQKFrBNdDJzP/EmX/w6XPOJ
X-Gm-Gg: AZuq6aKSZY9weEBivnVFsC17wn0d3eBVv5lmnhi1/4EC97XKye5gj8KEAYUPdWLedSU
	mJI8LU8Evt8ehyldobrsdIrpiyvw2a6SDzyZ9x42kDywDNP2nP5bf3jNOM5jKhd/e56A1iXN9Wr
	tTQJQQW82UZ+wBb0nG951VzIrT8zYwFAYPNxwJiEG5CZw+GQgHGmtNgA5W8ouG6OzPzAeWLaUBH
	A/rzgw5l9rHxy+DxxK4wFGCI6cCODG8p6jOHv1vy4DVMHZfGQuTzW34hIHK1+hA7dyFcaLFb6Ke
	/iWNxFi0lowAXlagU19dAvJKIXQ9yxyd4fKkrgY7mSqhwYA2u1oq2VbJUu/KUXQVrSSYdqs2BKC
	1brkitfnu3m0ZZCzkF9IsiIjzNK5/kyLz3p2LljRhXdBWnc/DZoEcuLktB8NlSnZoskrjzqlh42
	DghjLu6ld37n3ahg==
X-Received: by 2002:a05:7300:a507:b0:2b7:bd6:a4d with SMTP id
 5a478bee46e88-2b78d9f623amr3994885eec.35.1769656808952;
        Wed, 28 Jan 2026 19:20:08 -0800 (PST)
Received: from debian ([74.48.213.230])
        by smtp.gmail.com with ESMTPSA id
 a92af1059eb24-124a9efd3b8sm4735353c88.17.2026.01.28.19.20.05
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 28 Jan 2026 19:20:08 -0800 (PST)
From: Qiliang Yuan <realwujing@gmail.com>
To: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Qiliang Yuan <realwujing@gmail.com>,
	Qiliang Yuan <yuanql9@chinatelecom.cn>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	linux-kernel@vger.kernel.org
Subject: [PATCH v4] sched/fair: Cache NUMA node statistics to avoid O(N)
 scanning
Date: Wed, 28 Jan 2026 22:19:43 -0500
Message-ID: <20260129031948.2384178-1-realwujing@gmail.com>
X-Mailer: git-send-email 2.51.0
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Optimize update_numa_stats() by reducing the algorithmic complexity of
node statistics aggregation from O(N_cpus_per_node) to O(1) for the source
node.

Currently, the NUMA balancing path performs an O(N) scan across all CPUs
in a node to aggregate load, utilization, and runnable statistics. On
large-scale systems, this repetitive traversal of per-CPU runqueues
incurs significant overhead.

Introduce a per-node statistics cache within the 'root_domain' and update
it during the load-balancing hot path (update_sg_lb_stats). Leverage these
pre-aggregated stats to allow update_numa_stats() to bypass the O(N) scan
for the source node, achieving O(1) complexity. Maintain consistency
through a seqlock mechanism to prevent data tearing during concurrent
updates. Amortize the cost of data collection and significantly reduce
scheduling overhead in the NUMA placement hot path.

Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
---
v4:
 - Implement seqlock (seqcount_t + spinlock_t) to ensure multi-variable con=
sistency and prevent tearing.
 - Move cache to 'root_domain' with dynamic allocation to handle partitioni=
ng and save memory on non-NUMA.
 - Add #ifdef CONFIG_NUMA guards and improve code robustness.
v3:
 - Optimize update_numa_stats() to O(1) for source node by caching stats du=
ring load balancing.
 - Introduce a global cache (later refined in v4).
v2:
 - Use cached stats from the first group in the hierarchy.
v1:
 - Initial implementation of NUMA stats caching.

 kernel/sched/fair.c     | 64 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h    | 17 +++++++++++
 kernel/sched/topology.c | 20 +++++++++++++
 3 files changed, 101 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e71302282671..eafe1cfd7ff6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -48,6 +48,7 @@
 #include <linux/psi.h>
 #include <linux/ratelimit.h>
 #include <linux/task_work.h>
+#include <linux/seqlock.h>
 #include <linux/rbtree_augmented.h>
=20
 #include <asm/switch_to.h>
@@ -2104,6 +2105,43 @@ static void update_numa_stats(struct task_numa_env *=
env,
 	ns->idle_cpu =3D -1;
=20
 	rcu_read_lock();
+	/*
+	 * Algorithmic Optimization: Avoid O(N) scan by using cached stats.
+	 * Only applicable for the source node where we don't need to find
+	 * an idle CPU. This skip is safe because we only need the totals
+	 * for the source node's load, util, and nr_running to check for
+	 * relative imbalance.
+	 */
+	if (!find_idle && nid =3D=3D env->src_nid) {
+		struct root_domain *rd =3D rcu_dereference(cpu_rq(env->src_cpu)->rd);
+		struct numa_stats_cache *cache;
+		unsigned int seq;
+
+		if (rd && rd->node_stats_cache) {
+			cache =3D &rd->node_stats_cache[nid];
+			do {
+				seq =3D read_seqcount_begin(&cache->seq);
+				if (!time_before(jiffies, cache->last_update + msecs_to_jiffies(10))) {
+					ns->compute_capacity =3D 0;
+					break;
+				}
+
+				ns->load =3D cache->load;
+				ns->runnable =3D cache->runnable;
+				ns->util =3D cache->util;
+				ns->nr_running =3D cache->nr_running;
+				ns->compute_capacity =3D cache->capacity;
+			} while (read_seqcount_retry(&cache->seq, seq));
+
+			if (ns->compute_capacity)
+				goto skip_scan;
+
+			/* Reset stats before O(N) scan if cache was stale or torn */
+			memset(ns, 0, sizeof(*ns));
+			ns->idle_cpu =3D -1;
+		}
+	}
+
 	for_each_cpu(cpu, cpumask_of_node(nid)) {
 		struct rq *rq =3D cpu_rq(cpu);
=20
@@ -2124,6 +2162,8 @@ static void update_numa_stats(struct task_numa_env *e=
nv,
 			idle_core =3D numa_idle_core(idle_core, cpu);
 		}
 	}
+
+skip_scan:
 	rcu_read_unlock();
=20
 	ns->weight =3D cpumask_weight(cpumask_of_node(nid));
@@ -10488,6 +10528,30 @@ static inline void update_sg_lb_stats(struct lb_en=
v *env,
 	if (sgs->group_type =3D=3D group_overloaded)
 		sgs->avg_load =3D (sgs->group_load * SCHED_CAPACITY_SCALE) /
 				sgs->group_capacity;
+
+	/* Algorithmic Optimization: Cache node stats for O(1) NUMA lookups */
+#ifdef CONFIG_NUMA
+	if (env->sd->flags & SD_NUMA) {
+		int nid =3D cpu_to_node(cpumask_first(sched_group_span(group)));
+		struct root_domain *rd =3D env->dst_rq->rd;
+		struct numa_stats_cache *cache;
+
+		if (rd && rd->node_stats_cache) {
+			cache =3D &rd->node_stats_cache[nid];
+
+			spin_lock(&cache->lock);
+			write_seqcount_begin(&cache->seq);
+			cache->nr_running =3D sgs->sum_h_nr_running;
+			cache->load =3D sgs->group_load;
+			cache->util =3D sgs->group_util;
+			cache->runnable =3D sgs->group_runnable;
+			cache->capacity =3D sgs->group_capacity;
+			cache->last_update =3D jiffies;
+			write_seqcount_end(&cache->seq);
+			spin_unlock(&cache->lock);
+		}
+	}
+#endif
 }
=20
 /**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d30cca6870f5..643e6c743fc8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -974,6 +974,19 @@ struct perf_domain {
 	struct rcu_head rcu;
 };
=20
+#ifdef CONFIG_NUMA
+struct numa_stats_cache {
+	spinlock_t lock;
+	seqcount_t seq;
+	unsigned long load;
+	unsigned long runnable;
+	unsigned long util;
+	unsigned long capacity;
+	unsigned int nr_running;
+	unsigned long last_update;
+} ____cacheline_aligned;
+#endif
+
 /*
  * We add the notion of a root-domain which will be used to define per-dom=
ain
  * variables. Each exclusive cpuset essentially defines an island domain by
@@ -1042,6 +1055,10 @@ struct root_domain {
 	 * CPUs of the rd. Protected by RCU.
 	 */
 	struct perf_domain __rcu *pd;
+
+#ifdef CONFIG_NUMA
+	struct numa_stats_cache *node_stats_cache;
+#endif
 };
=20
 extern void init_defrootdomain(void);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd..094a4f197af3 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -5,6 +5,7 @@
=20
 #include <linux/sched/isolation.h>
 #include <linux/bsearch.h>
+#include <linux/slab.h>
 #include "sched.h"
=20
 DEFINE_MUTEX(sched_domains_mutex);
@@ -466,6 +467,9 @@ static void free_rootdomain(struct rcu_head *rcu)
 	free_cpumask_var(rd->online);
 	free_cpumask_var(rd->span);
 	free_pd(rd->pd);
+#ifdef CONFIG_NUMA
+	kfree(rd->node_stats_cache);
+#endif
 	kfree(rd);
 }
=20
@@ -551,8 +555,24 @@ static int init_rootdomain(struct root_domain *rd)
=20
 	if (cpupri_init(&rd->cpupri) !=3D 0)
 		goto free_cpudl;
+
+#ifdef CONFIG_NUMA
+	rd->node_stats_cache =3D kcalloc(nr_node_ids, sizeof(struct numa_stats_ca=
che),
+				      GFP_KERNEL);
+	if (!rd->node_stats_cache)
+		goto free_cpupri;
+
+	int i;
+	for (i =3D 0; i < nr_node_ids; i++) {
+		spin_lock_init(&rd->node_stats_cache[i].lock);
+		seqcount_init(&rd->node_stats_cache[i].seq);
+	}
+#endif
+
 	return 0;
=20
+free_cpupri:
+	cpupri_cleanup(&rd->cpupri);
 free_cpudl:
 	cpudl_cleanup(&rd->cpudl);
 free_rto_mask:
--=20
2.51.0