From nobody Sat Feb 7 18:21:15 2026 Received: from mail-dy1-f173.google.com (mail-dy1-f173.google.com [74.125.82.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 20A9C270568 for ; Thu, 29 Jan 2026 03:20:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=74.125.82.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769656811; cv=none; b=h5JBsaYZc632OiYTYozUtK4ASaMSu+GH5rSXQHjUdWEKylwrotCpGACKmyFVNN5L5pc3rK+gKIflBrLc05xM3OkjU68pSfMkHXgG6QhxWp+9E5PdOD6qXPUpXUQtZBJ8IKYJRdrBczrXq4SVur4z1dReUXF1Xo4N2ilXxMAJSn4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769656811; c=relaxed/simple; bh=nxSnZz+up5ojPHN4EpAcH0h98d0Q5Z6PzVdtOELc8C8=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=Lh548LWbhbzj1Zl+L+G7FWunkTPIfPqjq8Xoz7LgVmgGOm6xhYubfDEDdP7WHWKl37MtCM1OByIjwhoYO3ss4JVR7ABUUNhxMFTkBzBJBF/EoRwf9omC34y6I8r70UqsWVA3HGFo7qLCEZEj0zEgp43rgc9FmMAhBub0UCo/tT4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Yo7V6gvW; arc=none smtp.client-ip=74.125.82.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Yo7V6gvW" Received: by mail-dy1-f173.google.com with SMTP id 5a478bee46e88-2b70abe3417so1386629eec.0 for ; Wed, 28 Jan 2026 19:20:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1769656809; x=1770261609; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=ZmsKpegoOuoFiHJpGwvYTLP4ZAdpyMRNMKAeyZSfB6c=; b=Yo7V6gvWvZKylHSZ+aTwXohu2VxwNqMteY5J2U88jmzm1UzHCnBfyUewEEy6+YrJ2S NDBbtTUvZOT5RzleqGREz9VihJwR8vdYgVUxmxBDpBuGbfvr2MzulU6koYqwpY7gImtg LQ7zDsPQoVuv/YXmSQpK4+7EqM3xzhSEH6MMg7x5xuW30lCYH75KhXSCu9U2+gQv+UB/ DmXXmuG1nekIRobrLxyiW0szRbBuixYqxKiHQyZe4ZSg2MDZdTOfZw4aPqrwMeA/vhsH p1MR4fMm4hYiQpY3/K0ZcNLmizkGSnc54aSZCnA8eae4MpR5ZpFPwbshsDQ6hDoxmDbv bkFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1769656809; x=1770261609; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=ZmsKpegoOuoFiHJpGwvYTLP4ZAdpyMRNMKAeyZSfB6c=; b=c5riNuWgwsrcVsBUo91CGdD2DJV0EVdgB0xZngGPCwblZ/XBP/DUdeZaZv/qEACT9n 7uxKUi4urrX6FPF4KCSDx8a73WjwY1fpsJjpEVwLGS2Xqte+mok3D7mjgFO5Rjk/HOE7 IChr+x2bkWi/GiYSSPZPg31qaanH+c9vRAPuVQD2MaGxUyOytkero/LAkhIbyW9bbyG3 UkZhSU9x1N7pyS+7yGX4/0yybV2Kkp06bzw7CCyVd3COaBsJiXgG/jD9Vlfel3AB6WHt /5I0tsAfQryBIRktPvBT8X4vUgo4Xrvs3z919RbYr+N04Lh3eu7uEZMaraWfb8AHfszK ar9g== X-Forwarded-Encrypted: i=1; AJvYcCW0zMvATmdRMh2KFClWyszaECu1t+FEbfw6SmM62BiZ3R4y4hXR0bqLpRArxqFza25mcHQ+gX8XLYK/PPc=@vger.kernel.org X-Gm-Message-State: AOJu0YxyW1riGUxFMNF+mYB2Ql2/+W62jxKDingBTFJtwSpegxjlXEqH tHDSXBoh+HhuDuXhGmt2nMRIBivofR/7ZsQKFrBNdDJzP/EmX/w6XPOJ X-Gm-Gg: AZuq6aKSZY9weEBivnVFsC17wn0d3eBVv5lmnhi1/4EC97XKye5gj8KEAYUPdWLedSU mJI8LU8Evt8ehyldobrsdIrpiyvw2a6SDzyZ9x42kDywDNP2nP5bf3jNOM5jKhd/e56A1iXN9Wr tTQJQQW82UZ+wBb0nG951VzIrT8zYwFAYPNxwJiEG5CZw+GQgHGmtNgA5W8ouG6OzPzAeWLaUBH A/rzgw5l9rHxy+DxxK4wFGCI6cCODG8p6jOHv1vy4DVMHZfGQuTzW34hIHK1+hA7dyFcaLFb6Ke /iWNxFi0lowAXlagU19dAvJKIXQ9yxyd4fKkrgY7mSqhwYA2u1oq2VbJUu/KUXQVrSSYdqs2BKC 1brkitfnu3m0ZZCzkF9IsiIjzNK5/kyLz3p2LljRhXdBWnc/DZoEcuLktB8NlSnZoskrjzqlh42 DghjLu6ld37n3ahg== X-Received: by 2002:a05:7300:a507:b0:2b7:bd6:a4d with SMTP id 5a478bee46e88-2b78d9f623amr3994885eec.35.1769656808952; Wed, 28 Jan 2026 19:20:08 -0800 (PST) Received: from debian ([74.48.213.230]) by smtp.gmail.com with ESMTPSA id a92af1059eb24-124a9efd3b8sm4735353c88.17.2026.01.28.19.20.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Jan 2026 19:20:08 -0800 (PST) From: Qiliang Yuan To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot Cc: Qiliang Yuan , Qiliang Yuan , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , linux-kernel@vger.kernel.org Subject: [PATCH v4] sched/fair: Cache NUMA node statistics to avoid O(N) scanning Date: Wed, 28 Jan 2026 22:19:43 -0500 Message-ID: <20260129031948.2384178-1-realwujing@gmail.com> X-Mailer: git-send-email 2.51.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Optimize update_numa_stats() by reducing the algorithmic complexity of node statistics aggregation from O(N_cpus_per_node) to O(1) for the source node. Currently, the NUMA balancing path performs an O(N) scan across all CPUs in a node to aggregate load, utilization, and runnable statistics. On large-scale systems, this repetitive traversal of per-CPU runqueues incurs significant overhead. Introduce a per-node statistics cache within the 'root_domain' and update it during the load-balancing hot path (update_sg_lb_stats). Leverage these pre-aggregated stats to allow update_numa_stats() to bypass the O(N) scan for the source node, achieving O(1) complexity. Maintain consistency through a seqlock mechanism to prevent data tearing during concurrent updates. Amortize the cost of data collection and significantly reduce scheduling overhead in the NUMA placement hot path. Signed-off-by: Qiliang Yuan Signed-off-by: Qiliang Yuan --- v4: - Implement seqlock (seqcount_t + spinlock_t) to ensure multi-variable con= sistency and prevent tearing. - Move cache to 'root_domain' with dynamic allocation to handle partitioni= ng and save memory on non-NUMA. - Add #ifdef CONFIG_NUMA guards and improve code robustness. v3: - Optimize update_numa_stats() to O(1) for source node by caching stats du= ring load balancing. - Introduce a global cache (later refined in v4). v2: - Use cached stats from the first group in the hierarchy. v1: - Initial implementation of NUMA stats caching. kernel/sched/fair.c | 64 +++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 17 +++++++++++ kernel/sched/topology.c | 20 +++++++++++++ 3 files changed, 101 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e71302282671..eafe1cfd7ff6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -48,6 +48,7 @@ #include #include #include +#include #include =20 #include @@ -2104,6 +2105,43 @@ static void update_numa_stats(struct task_numa_env *= env, ns->idle_cpu =3D -1; =20 rcu_read_lock(); + /* + * Algorithmic Optimization: Avoid O(N) scan by using cached stats. + * Only applicable for the source node where we don't need to find + * an idle CPU. This skip is safe because we only need the totals + * for the source node's load, util, and nr_running to check for + * relative imbalance. + */ + if (!find_idle && nid =3D=3D env->src_nid) { + struct root_domain *rd =3D rcu_dereference(cpu_rq(env->src_cpu)->rd); + struct numa_stats_cache *cache; + unsigned int seq; + + if (rd && rd->node_stats_cache) { + cache =3D &rd->node_stats_cache[nid]; + do { + seq =3D read_seqcount_begin(&cache->seq); + if (!time_before(jiffies, cache->last_update + msecs_to_jiffies(10))) { + ns->compute_capacity =3D 0; + break; + } + + ns->load =3D cache->load; + ns->runnable =3D cache->runnable; + ns->util =3D cache->util; + ns->nr_running =3D cache->nr_running; + ns->compute_capacity =3D cache->capacity; + } while (read_seqcount_retry(&cache->seq, seq)); + + if (ns->compute_capacity) + goto skip_scan; + + /* Reset stats before O(N) scan if cache was stale or torn */ + memset(ns, 0, sizeof(*ns)); + ns->idle_cpu =3D -1; + } + } + for_each_cpu(cpu, cpumask_of_node(nid)) { struct rq *rq =3D cpu_rq(cpu); =20 @@ -2124,6 +2162,8 @@ static void update_numa_stats(struct task_numa_env *e= nv, idle_core =3D numa_idle_core(idle_core, cpu); } } + +skip_scan: rcu_read_unlock(); =20 ns->weight =3D cpumask_weight(cpumask_of_node(nid)); @@ -10488,6 +10528,30 @@ static inline void update_sg_lb_stats(struct lb_en= v *env, if (sgs->group_type =3D=3D group_overloaded) sgs->avg_load =3D (sgs->group_load * SCHED_CAPACITY_SCALE) / sgs->group_capacity; + + /* Algorithmic Optimization: Cache node stats for O(1) NUMA lookups */ +#ifdef CONFIG_NUMA + if (env->sd->flags & SD_NUMA) { + int nid =3D cpu_to_node(cpumask_first(sched_group_span(group))); + struct root_domain *rd =3D env->dst_rq->rd; + struct numa_stats_cache *cache; + + if (rd && rd->node_stats_cache) { + cache =3D &rd->node_stats_cache[nid]; + + spin_lock(&cache->lock); + write_seqcount_begin(&cache->seq); + cache->nr_running =3D sgs->sum_h_nr_running; + cache->load =3D sgs->group_load; + cache->util =3D sgs->group_util; + cache->runnable =3D sgs->group_runnable; + cache->capacity =3D sgs->group_capacity; + cache->last_update =3D jiffies; + write_seqcount_end(&cache->seq); + spin_unlock(&cache->lock); + } + } +#endif } =20 /** diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d30cca6870f5..643e6c743fc8 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -974,6 +974,19 @@ struct perf_domain { struct rcu_head rcu; }; =20 +#ifdef CONFIG_NUMA +struct numa_stats_cache { + spinlock_t lock; + seqcount_t seq; + unsigned long load; + unsigned long runnable; + unsigned long util; + unsigned long capacity; + unsigned int nr_running; + unsigned long last_update; +} ____cacheline_aligned; +#endif + /* * We add the notion of a root-domain which will be used to define per-dom= ain * variables. Each exclusive cpuset essentially defines an island domain by @@ -1042,6 +1055,10 @@ struct root_domain { * CPUs of the rd. Protected by RCU. */ struct perf_domain __rcu *pd; + +#ifdef CONFIG_NUMA + struct numa_stats_cache *node_stats_cache; +#endif }; =20 extern void init_defrootdomain(void); diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index cf643a5ddedd..094a4f197af3 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -5,6 +5,7 @@ =20 #include #include +#include #include "sched.h" =20 DEFINE_MUTEX(sched_domains_mutex); @@ -466,6 +467,9 @@ static void free_rootdomain(struct rcu_head *rcu) free_cpumask_var(rd->online); free_cpumask_var(rd->span); free_pd(rd->pd); +#ifdef CONFIG_NUMA + kfree(rd->node_stats_cache); +#endif kfree(rd); } =20 @@ -551,8 +555,24 @@ static int init_rootdomain(struct root_domain *rd) =20 if (cpupri_init(&rd->cpupri) !=3D 0) goto free_cpudl; + +#ifdef CONFIG_NUMA + rd->node_stats_cache =3D kcalloc(nr_node_ids, sizeof(struct numa_stats_ca= che), + GFP_KERNEL); + if (!rd->node_stats_cache) + goto free_cpupri; + + int i; + for (i =3D 0; i < nr_node_ids; i++) { + spin_lock_init(&rd->node_stats_cache[i].lock); + seqcount_init(&rd->node_stats_cache[i].seq); + } +#endif + return 0; =20 +free_cpupri: + cpupri_cleanup(&rd->cpupri); free_cpudl: cpudl_cleanup(&rd->cpudl); free_rto_mask: --=20 2.51.0