From nobody Fri Dec 19 20:53:15 2025 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9101E2EC54D for ; Wed, 3 Dec 2025 23:01:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.11 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802913; cv=none; b=mZ6zgozB73YTe2Q60NzNJeXcrA6dwd6hmTIv0PKyoFj0ekz5KBJkRG1qM2/BURh0aF7CFHE0sYQDT25Sh/ho6UmSGiIRzP3Vlf26ErGeRZYynNy7Hu4jA7k4JybnWrC09LDy8qEGxsIyAxdcr/3QTceL1Zxm0kxxCEBV46nlDEI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764802913; c=relaxed/simple; bh=ty+thnKFxG9+3T4ifTVEX04pmBe/l14iXANMioAm72I=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=bowCsa1//bbyzKU9WSJiWQsUHsXrqBQlvs/cAKgMyk/m4Bld010TDYg5UwVzdHKRvlpaid+xFoVz12quGwWlGa5F6HadDbBqKTBPP6/p1CNg91urhPN3p32qxubeGCoBIbuMM7MCO6I/YdFGB6u4/f5TpvPg3YmLnLcjC8/C7Xc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=bxWe6OeK; arc=none smtp.client-ip=198.175.65.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="bxWe6OeK" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1764802910; x=1796338910; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ty+thnKFxG9+3T4ifTVEX04pmBe/l14iXANMioAm72I=; b=bxWe6OeKUH0dPxqgW1jI5HE2e1z6OmOiyR4hMvqwqKai+AqvYcbOCYwu JOlPn9ZWYosHECHx5UGnkdTGEzkOmDWCRC2K3ypKwePUhIyD1337RCjJ3 uixa8Z2lYSQS2J5GJVC48B2f/yhUzBFPqFV4CEHvCoMLsK1cOf7W1aP4l eQBVHvIxVJB4mpBt3ae1f/13ipHHAFwfwmFLo4k5SToBHKxSAT6nyvK8a Vm37u8PzhAmKBcxxBJlGGGzpwc2T4MC/PWSin17i5/r/Xk+DaSUzLnxaF ZlP2B1+lT/NuonQU/h16sWvSe3/WRw4AeV5gKIbsttEfaewPOisfGEd7j g==; X-CSE-ConnectionGUID: Jgmht7L1SaW2ul5kAUA6dw== X-CSE-MsgGUID: 8CE3l3r/SEaFaHk/6vdVRg== X-IronPort-AV: E=McAfee;i="6800,10657,11631"; a="77136653" X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="77136653" Received: from fmviesa004.fm.intel.com ([10.60.135.144]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Dec 2025 15:01:49 -0800 X-CSE-ConnectionGUID: 88MGzjBCTmOWjRxLdU7vUw== X-CSE-MsgGUID: Bi68ivGaS76IdMdbGxb19w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.20,247,1758610800"; d="scan'208";a="199763965" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by fmviesa004.fm.intel.com with ESMTP; 03 Dec 2025 15:01:49 -0800 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , "Gautham R . Shenoy" , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , linux-kernel@vger.kernel.org Subject: [PATCH v2 19/23] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Date: Wed, 3 Dec 2025 15:07:38 -0800 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Prateek and Tingyin reported that memory-intensive workloads (such as stream) can saturate memory bandwidth and caches on the preferred LLC when sched_cache aggregates too many threads. To mitigate this, estimate a process's memory footprint by comparing its RSS (anonymous and shared pages) to the size of the LLC. If RSS exceeds the LLC size, skip cache-aware scheduling. Note that RSS is only an approximation of the memory footprint. By default, the comparison is strict, but a later patch will allow users to provide a hint to adjust this threshold. According to the test from Adam, some systems do not have shared L3 but with shared L2 as clusters. In this case, the L2 becomes the LLC[1]. Link[1]: https://lore.kernel.org/all/3cb6ebc7-a2fd-42b3-8739-b00e28a09cb6@o= s.amperecomputing.com/ Co-developed-by: Tim Chen Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- Notes: v1->v2: Assigned curr_cpu in task_cache_work() before checking exceed_llc_capacity(mm, curr_cpu) to avoid out-of-bound access.(lkp/0day) include/linux/cacheinfo.h | 21 ++++++++++------- kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++---- 2 files changed, 57 insertions(+), 13 deletions(-) diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index c8f4f0a0b874..82d0d59ca0e1 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -113,18 +113,11 @@ int acpi_get_cache_info(unsigned int cpu, =20 const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_= leaf); =20 -/* - * Get the cacheinfo structure for the cache associated with @cpu at - * level @level. - * cpuhp lock must be held. - */ -static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level) +static inline struct cacheinfo *_get_cpu_cacheinfo_level(int cpu, int leve= l) { struct cpu_cacheinfo *ci =3D get_cpu_cacheinfo(cpu); int i; =20 - lockdep_assert_cpus_held(); - for (i =3D 0; i < ci->num_leaves; i++) { if (ci->info_list[i].level =3D=3D level) { if (ci->info_list[i].attributes & CACHE_ID) @@ -136,6 +129,18 @@ static inline struct cacheinfo *get_cpu_cacheinfo_leve= l(int cpu, int level) return NULL; } =20 +/* + * Get the cacheinfo structure for the cache associated with @cpu at + * level @level. + * cpuhp lock must be held. + */ +static inline struct cacheinfo *get_cpu_cacheinfo_level(int cpu, int level) +{ + lockdep_assert_cpus_held(); + + return _get_cpu_cacheinfo_level(cpu, level); +} + /* * Get the id of the cache associated with @cpu at level @level. * cpuhp lock must be held. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6afa3f9a4e9b..424ec601cfdf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1223,6 +1223,38 @@ static int llc_id(int cpu) return llc; } =20 +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu) +{ + struct cacheinfo *ci; + unsigned long rss; + unsigned int llc; + + /* + * get_cpu_cacheinfo_level() can not be used + * because it requires the cpu_hotplug_lock + * to be held. Use _get_cpu_cacheinfo_level() + * directly because the 'cpu' can not be + * offlined at the moment. + */ + ci =3D _get_cpu_cacheinfo_level(cpu, 3); + if (!ci) { + /* + * On system without L3 but with shared L2, + * L2 becomes the LLC. + */ + ci =3D _get_cpu_cacheinfo_level(cpu, 2); + if (!ci) + return true; + } + + llc =3D ci->size; + + rss =3D get_mm_counter(mm, MM_ANONPAGES) + + get_mm_counter(mm, MM_SHMEMPAGES); + + return (llc <=3D (rss * PAGE_SIZE)); +} + static bool exceed_llc_nr(struct mm_struct *mm, int cpu) { int smt_nr =3D 1; @@ -1382,7 +1414,8 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) */ if (epoch - READ_ONCE(mm->mm_sched_epoch) > EPOCH_LLC_AFFINITY_TIMEOUT || get_nr_threads(p) <=3D 1 || - exceed_llc_nr(mm, cpu_of(rq))) { + exceed_llc_nr(mm, cpu_of(rq)) || + exceed_llc_capacity(mm, cpu_of(rq))) { if (mm->mm_sched_cpu !=3D -1) mm->mm_sched_cpu =3D -1; } @@ -1439,7 +1472,7 @@ static void __no_profile task_cache_work(struct callb= ack_head *work) struct mm_struct *mm =3D p->mm; unsigned long m_a_occ =3D 0; unsigned long curr_m_a_occ =3D 0; - int cpu, m_a_cpu =3D -1, nr_running =3D 0; + int cpu, m_a_cpu =3D -1, nr_running =3D 0, curr_cpu; cpumask_var_t cpus; =20 WARN_ON_ONCE(work !=3D &p->cache_work); @@ -1449,7 +1482,9 @@ static void __no_profile task_cache_work(struct callb= ack_head *work) if (p->flags & PF_EXITING) return; =20 - if (get_nr_threads(p) <=3D 1) { + curr_cpu =3D task_cpu(p); + if (get_nr_threads(p) <=3D 1 || + exceed_llc_capacity(mm, curr_cpu)) { if (mm->mm_sched_cpu !=3D -1) mm->mm_sched_cpu =3D -1; =20 @@ -9895,8 +9930,12 @@ static enum llc_mig can_migrate_llc_task(int src_cpu= , int dst_cpu, if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu)) return mig_unrestricted; =20 - /* skip cache aware load balance for single/too many threads */ - if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu)) + /* + * Skip cache aware load balance for single/too many threads + * or large footprint. + */ + if (get_nr_threads(p) <=3D 1 || exceed_llc_nr(mm, dst_cpu) || + exceed_llc_capacity(mm, dst_cpu)) return mig_unrestricted; =20 if (cpus_share_cache(dst_cpu, cpu)) --=20 2.32.0