From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A592F25B0B6 for ; Wed, 13 May 2026 20:33:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704415; cv=none; b=RbyuV76HA0Pp4brRK2DNoWoXHpU2ScD7aBMla60KJQSou6VNJRYOtPrFF3hctkg6gxVY8RLMqleprjkJnJWd4rCOQoNUPY5/Mvy2e/fcYibDL+7lNp6VFT6ze/vipd0tYKJG1ihHdWj2OP/jObEKldiC6GymRrQQff/4C90HwIc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704415; c=relaxed/simple; bh=9dRtMFEnE3sO2bgwlBXd25CqjvGaW8hJO8hC9cFO1v4=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=MCrZToyTXSGdEmzebuGMCckYIADKpUaobeqDJpdjgQX6kZVOQiGLqx4lLVwnaabiZ64XWvj452F1lFA/Z37vAn1lCetCDJYhvUfXJUS+HC2p8Zss4pTBBBS5bLmZFPYD7jWA8GQI59gNi/CtMPVvdm+V3iPV2C6bSqUUB2x9vnY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=DTQ7q5O8; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="DTQ7q5O8" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704414; x=1810240414; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9dRtMFEnE3sO2bgwlBXd25CqjvGaW8hJO8hC9cFO1v4=; b=DTQ7q5O8z9N0MITEoysSiUb75BiQcKrdlno02Xo/kHVdAGdn1TZC1d/t +DkgHJPXgvwtNyfbL/l3M3p48P2fEVnTkt46AmA8gDrOCjQhQB3AW+K0B 2PZ4gLygmoUsdGqQ5uV29EHvGiQCWMTq1/O73GyrBebehnORmF3lEi+Z5 Ot5p9tW2o27avxGsk4W9CAzlkGwloIfu1u44zxutfKeqPlcLrgDP0gYx8 K+3D3syno0dlqMyGoE8TiiY8lPCRgX03kFME22RecO14q6Y+4UZJD5kyX HxH+pXBKVyurxjbGxwx9AeXkS1QWQ+eYXYJVRuF22uOrqHJ+zZWvuoSri w==; X-CSE-ConnectionGUID: /hNjS7tiT8a3DisGQHNXhQ== X-CSE-MsgGUID: I8Drt1ZeSRy5I38KZBdzrw== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79622968" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79622968" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:33 -0700 X-CSE-ConnectionGUID: WFVFDNOkSNC756h9ak1WyA== X-CSE-MsgGUID: vyvJQmtlRcG+Vk5HobAMBg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076313" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:32 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Jianyong Wu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 01/16] sched/cache: Allow only 1 thread of the process to calculate the LLC occupancy Date: Wed, 13 May 2026 13:39:12 -0700 Message-Id: <5672b52e588b855b01e5a1a17822f7c6c7237a3d.1778703694.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Jianyong Wu Scanning online CPUs to calculate the occupancy might be time-consuming. Only allow 1 thread of the process to scan the CPUs at the same time, which is similar to what NUMA balance does in task_numa_work(). Signed-off-by: Jianyong Wu Signed-off-by: Chen Yu Signed-off-by: Tim Chen --- include/linux/sched.h | 1 + kernel/sched/fair.c | 11 +++++++++++ 2 files changed, 12 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index d2010483cd77..6d883f109ba3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2423,6 +2423,7 @@ struct sched_cache_stat { struct sched_cache_time __percpu *pcpu_sched; raw_spinlock_t lock; unsigned long epoch; + unsigned long next_scan; int cpu; } ____cacheline_aligned_in_smp; =20 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5f22e5a097cf..a759ea669d74 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1451,6 +1451,7 @@ void mm_init_sched(struct mm_struct *mm, raw_spin_lock_init(&mm->sc_stat.lock); mm->sc_stat.epoch =3D epoch; mm->sc_stat.cpu =3D -1; + mm->sc_stat.next_scan =3D jiffies; =20 /* * The update to mm->sc_stat should not be reordered @@ -1661,6 +1662,7 @@ static void get_scan_cpumasks(cpumask_var_t cpus, str= uct task_struct *p) =20 static void task_cache_work(struct callback_head *work) { + unsigned long next_scan, now =3D jiffies; struct task_struct *p =3D current; struct mm_struct *mm =3D p->mm; unsigned long m_a_occ =3D 0; @@ -1675,6 +1677,15 @@ static void task_cache_work(struct callback_head *wo= rk) if (p->flags & PF_EXITING) return; =20 + next_scan =3D READ_ONCE(mm->sc_stat.next_scan); + if (time_before(now, next_scan)) + return; + + /* only 1 thread is allowed to scan */ + if (!try_cmpxchg(&mm->sc_stat.next_scan, &next_scan, + now + EPOCH_PERIOD)) + return; + if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) return; =20 --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6793C368D66 for ; Wed, 13 May 2026 20:33:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704415; cv=none; b=OAqVDa/lFfVL3UQGME5wTQubZDB+IetqwO4vbdDtOwYbO+peIiwXp0ZdtuNMMiGFkVou/LX7z38n7D8ihGjTJ4yaV0++HkpW8bkZCp0yT2Gis5AjrUJxgVX97HjVSXDDtDlas10aXFM2iT21wFrAlR/FMuhNM9rNPK7BkPzTglA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704415; c=relaxed/simple; bh=z9tPKi8jNn9rAGHAS0N2TDEh0KhDuGReyOUiH6WoTpI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=n3TISUaEEdsz+Gzk7rz/xM06IymVU7r1Mk/tksDobLZCWzLEO7Pz+8LVT60jwFroa0yyRiiQnBJaPUI2Zdod5uWdxeodWsTWRs5KMR0NTbuMckruYgZLW7WUPzvGCK7jC9fA/el1XDlwmKYKmlSrhsAJTBHXGIAl9MzqdLLYEl8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=GoHO1cXB; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="GoHO1cXB" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704414; x=1810240414; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=z9tPKi8jNn9rAGHAS0N2TDEh0KhDuGReyOUiH6WoTpI=; b=GoHO1cXBlFwBLLAU2V60P1oxtpG/2lqj9nHimrI05zhMAj0bWjMnHmbr 3iacvsC4wfvXpU48EUxwwywgcFseJfqBlXD0Jy4hdKI1OFmC9mGR8Nxid eOpxJrdIRTVq5DA+6Y6N4cpK5ILENuvIRuYtWy+j3BAogJwstjG/N0MSY eWcWIAT3z+FCjovCcSe6KhzPue+ZToePYjjUEBhoTRhGBoPjER6Wx+OO0 FTbVzXajZEwTBKj8g/9wcqnCEuvVcMAL6CCfwe7wCDulFt7I2nlXZThLK 8GE0uZVROfhe9tB9JOrxt0bh3G1k6JWK7N7UNZpJTcpKGSEYTcbTA1lYE g==; X-CSE-ConnectionGUID: w/VUQ7TVRhOytU2kUjy9Dg== X-CSE-MsgGUID: Id7mw4tKTVOVj3zZcJM0YA== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79622991" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79622991" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:34 -0700 X-CSE-ConnectionGUID: m1wmAwtnTDGY3NBWlpu8wg== X-CSE-MsgGUID: PGlpxhm9SMioCgCxq9pXwg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076320" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:33 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 02/16] sched/cache: Disable cache aware scheduling for processes with high thread counts Date: Wed, 13 May 2026 13:39:13 -0700 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu A performance regression was observed by Prateek when running hackbench with many threads per process (high fd count). To avoid this, processes with a large number of active threads are excluded from cache-aware scheduling. With sched_cache enabled, record the number of active threads in each process during the periodic task_cache_work(). While iterating over CPUs, if the currently running task belongs to the same process as the task that launched task_cache_work(), increment the active thread count. If the number of active threads within the process exceeds the number of Cores (divided by the SMT number) in the LLC, do not enable cache-aware scheduling. However, on systems with a smaller number of CPUs within 1 LLC, like Power10/Power11 with SMT4 and an LLC size of 4, this check effectively disables cache-aware scheduling for any process. One possible solution suggested by Peter is to use an LLC-mask instead of a single LLC value for preference. Once there are a 'few' LLCs as preference, this constraint becomes a little easier. It could be an enhancement in the future. For users who wish to perform task aggregation regardless, a debugfs knob is provided for tuning in a subsequent change. Tested-by: Tingyin Duan Suggested-by: K Prateek Nayak Suggested-by: Aaron Lu Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- include/linux/sched.h | 1 + kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++----- 2 files changed, 44 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d883f109ba3..6701911eaaf7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2423,6 +2423,7 @@ struct sched_cache_stat { struct sched_cache_time __percpu *pcpu_sched; raw_spinlock_t lock; unsigned long epoch; + u64 nr_running_avg; unsigned long next_scan; int cpu; } ____cacheline_aligned_in_smp; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a759ea669d74..808f614fc2d2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1384,6 +1384,12 @@ static int llc_id(int cpu) return per_cpu(sd_llc_id, cpu); } =20 +static bool invalid_llc_nr(struct mm_struct *mm, int cpu) +{ + return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads), + per_cpu(sd_llc_size, cpu)); +} + static void account_llc_enqueue(struct rq *rq, struct task_struct *p) { struct sched_domain *sd; @@ -1452,7 +1458,7 @@ void mm_init_sched(struct mm_struct *mm, mm->sc_stat.epoch =3D epoch; mm->sc_stat.cpu =3D -1; mm->sc_stat.next_scan =3D jiffies; - + mm->sc_stat.nr_running_avg =3D 0; /* * The update to mm->sc_stat should not be reordered * before initialization to mm's other fields, in case @@ -1574,7 +1580,8 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) * If this process hasn't hit task_cache_work() for a while invalidate * its preferred state. */ - if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT) { + if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT || + invalid_llc_nr(mm, cpu_of(rq))) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; } @@ -1660,14 +1667,32 @@ static void get_scan_cpumasks(cpumask_var_t cpus, s= truct task_struct *p) cpumask_copy(cpus, cpu_online_mask); } =20 +static inline void update_avg_scale(u64 *avg, u64 sample) +{ + int factor =3D per_cpu(sd_llc_size, raw_smp_processor_id()); + s64 diff =3D sample - *avg; + u32 divisor; + + /* + * Scale the divisor based on the number of CPUs contained + * in the LLC. This scaling ensures smaller LLC domains use + * a smaller divisor to achieve more precise sensitivity to + * changes in nr_running, while larger LLC domains are capped + * at a maximum divisor of 8 which is the default smoothing + * factor of EWMA in update_avg(). + */ + divisor =3D clamp_t(u32, (factor >> 2), 2, 8); + *avg +=3D div64_s64(diff, divisor); +} + static void task_cache_work(struct callback_head *work) { unsigned long next_scan, now =3D jiffies; - struct task_struct *p =3D current; + struct task_struct *p =3D current, *cur; + int cpu, m_a_cpu =3D -1, nr_running =3D 0; + unsigned long curr_m_a_occ =3D 0; struct mm_struct *mm =3D p->mm; unsigned long m_a_occ =3D 0; - unsigned long curr_m_a_occ =3D 0; - int cpu, m_a_cpu =3D -1; cpumask_var_t cpus; =20 WARN_ON_ONCE(work !=3D &p->cache_work); @@ -1711,6 +1736,11 @@ static void task_cache_work(struct callback_head *wo= rk) m_occ =3D occ; m_cpu =3D i; } + + cur =3D rcu_dereference_all(cpu_rq(i)->curr); + if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) && + cur->mm =3D=3D mm) + nr_running++; } =20 /* @@ -1754,6 +1784,7 @@ static void task_cache_work(struct callback_head *wor= k) mm->sc_stat.cpu =3D m_a_cpu; } =20 + update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running); free_cpumask_var(cpus); } =20 @@ -10294,6 +10325,13 @@ static enum llc_mig can_migrate_llc_task(int src_c= pu, int dst_cpu, if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu)) return mig_unrestricted; =20 + /* skip cache aware load balance for too many threads */ + if (invalid_llc_nr(mm, dst_cpu)) { + if (mm->sc_stat.cpu !=3D -1) + mm->sc_stat.cpu =3D -1; + return mig_unrestricted; + } + if (cpus_share_cache(dst_cpu, cpu)) to_pref =3D true; else if (cpus_share_cache(src_cpu, cpu)) --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A1DD3379C2A for ; Wed, 13 May 2026 20:33:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704417; cv=none; b=WUnO8VfSGX3vhnxArs0Uji6u816DVeLBMp+M9ZLR8K58EmIX6H6F+sCjg7IQRNBkseTWqlr3WifchyYbVzPHAToK1hq0y0K8frVHy+ej3x9PHj0WSHDTxuGoYyDo/912M/laOCn3FFZhlJcRYR1tBrb5dfrqwW+hzVM6tQjwoHw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704417; c=relaxed/simple; bh=t7DrBUiucO8P42ebUXthxbTL07z1urjUPXtvf4iIW6M=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=UJ56jWievTM0ktGrW77uHituWF/8FQp9Tl0XWDvfxMewB+HebUfPa/aLFLxDwRIoF/I4xFjqasT2tJbwS0/dk45HAHn8KoRN7jAz2p2O33ww43c5vqJ74UOR0ukhq3AthnghtL4uUl4WBffYHIW0cYfjwYbfMzWPesZGD8BFFYc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=K2zLTnpR; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="K2zLTnpR" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704416; x=1810240416; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=t7DrBUiucO8P42ebUXthxbTL07z1urjUPXtvf4iIW6M=; b=K2zLTnpR7k92F6sDehbSbvuvgT8yTWHRW8gJ9ZMyN1uL2lxtNms+cm71 x36F4u5Ezk15osCxFhTeuSaxwt9dP1wzakGe+KvEK8abbhOnZPZ+Y0ul+ fy1p4vbtXROyZi/yYWkJ4KVmgvoK1IiXPRHWWW7zH4E3EhiHxU5hhTr/7 ZdTOxkR8SSsL72K3TfMSMWbz6OGkBPcLDj0B4CPPQWv+ydKfftuRv3sMG kTGmST0BchLyfi9B0xjpWB1rDbfCk9yrBTgc80niFHaQkXDiu8sSfAu0O ziTQH3Z2FGqmXELXWt0/a3DCedPJYNTxd23N4PLtmWdLB9gWF8VH1IShH w==; X-CSE-ConnectionGUID: VbUUpOLRQsWEy/XQKWOAHg== X-CSE-MsgGUID: xAuYKXMtRNK6mmooAVDoWQ== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623012" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623012" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:35 -0700 X-CSE-ConnectionGUID: ciPGxiMPQ5261SxmryPbmQ== X-CSE-MsgGUID: hy8evQcnQjebd8T6GYGxxw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076326" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:34 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 03/16] sched/cache: Skip cache-aware scheduling for single-threaded processes Date: Wed, 13 May 2026 13:39:14 -0700 Message-Id: <8a59a13aa58fdb48e410ecb2aabd97fe3ea5d256.1778703694.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu For a single thread, the current wakeup path tends to place it on the same LLC where it was previously running with cache-hot data. There is no need to enable cache-aware scheduling for single-threaded processes for the following reasons: 1. Cache-aware scheduling primarily benefits multi-threaded processes where threads share data. Single-threaded processes typically have no inter-thread data sharing and thus gain little. 2. Enabling it incurs the additional overhead of tracking the thread's residency in the LLCs. 3. Bypassing single-threaded processes avoids excessive concentration of such tasks on a single LLC. Nevertheless, this check can be omitted if users explicitly provide hints for such single-threaded workloads where different processes have shared memory, e.g., via prctl() or other interfaces to be added in the future. Tested-by: Tingyin Duan Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- kernel/sched/fair.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 808f614fc2d2..df21366ba1ca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1384,8 +1384,12 @@ static int llc_id(int cpu) return per_cpu(sd_llc_id, cpu); } =20 -static bool invalid_llc_nr(struct mm_struct *mm, int cpu) +static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p, + int cpu) { + if (get_nr_threads(p) <=3D 1) + return true; + return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads), per_cpu(sd_llc_size, cpu)); } @@ -1581,7 +1585,7 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) * its preferred state. */ if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT || - invalid_llc_nr(mm, cpu_of(rq))) { + invalid_llc_nr(mm, p, cpu_of(rq))) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; } @@ -1687,9 +1691,9 @@ static inline void update_avg_scale(u64 *avg, u64 sam= ple) =20 static void task_cache_work(struct callback_head *work) { + int cpu, m_a_cpu =3D -1, nr_running =3D 0, curr_cpu; unsigned long next_scan, now =3D jiffies; struct task_struct *p =3D current, *cur; - int cpu, m_a_cpu =3D -1, nr_running =3D 0; unsigned long curr_m_a_occ =3D 0; struct mm_struct *mm =3D p->mm; unsigned long m_a_occ =3D 0; @@ -1711,6 +1715,14 @@ static void task_cache_work(struct callback_head *wo= rk) now + EPOCH_PERIOD)) return; =20 + curr_cpu =3D task_cpu(p); + if (invalid_llc_nr(mm, p, curr_cpu)) { + if (mm->sc_stat.cpu !=3D -1) + mm->sc_stat.cpu =3D -1; + + return; + } + if (!zalloc_cpumask_var(&cpus, GFP_KERNEL)) return; =20 @@ -10326,7 +10338,7 @@ static enum llc_mig can_migrate_llc_task(int src_cp= u, int dst_cpu, return mig_unrestricted; =20 /* skip cache aware load balance for too many threads */ - if (invalid_llc_nr(mm, dst_cpu)) { + if (invalid_llc_nr(mm, p, dst_cpu)) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; return mig_unrestricted; --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CA929384CD7 for ; Wed, 13 May 2026 20:33:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704418; cv=none; b=XS4hFIWAtoKlvfHUg5ngzaDvgGMPuRPhvTziFxsKy3CSSaRAm2RGEbumT85oKEx+GloK3ZC01NdakQJNmLa0UQ4ZwshbJDGJPYewwv0Uih9gvVX+FDk/1bfkRff/1qWZzbG7gXOWNo/Iq5+cS3nle02nEdXxOT0YFa8965dR2+c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704418; c=relaxed/simple; bh=E0nEgU8NHJbSvpI+37J42hPBxHwU7Sw0zA2+0Rlp2Nw=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Ra0mAU5b1O7GlfQ4DjOi2pRS60BJn9T2NDJePcraX9CAVb2I+owMqUS1fzNtT/6KIx6bmFKPY04OiROLFs4ptWq4/aHe4H6eKs3YzKVDEPBThG/yzQmSbNlfHlkU3YGt/GxfoBf4FUCHgUVcoyR+CzeRRLgwwClyMKcGDUeLNmg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=hAsd8azL; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="hAsd8azL" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704417; x=1810240417; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=E0nEgU8NHJbSvpI+37J42hPBxHwU7Sw0zA2+0Rlp2Nw=; b=hAsd8azLyLnLG/XbuZB8UKl5HyYm4bFgjsvG3q3AXaoSEQp6OWl5B7E7 d9k4lEQFVfnl3mX4YO05lVmgutwlHbSE8dL4kgxB0oR+gFgnH9APGwHzi oNXic/z9WdRfq17a+gPWZZ6EEVkNh77MhsGGnSf/HD8rjtTN1o5EV+d9G GGRrVM2tleg7BjNcwiUSsxsNQYCqExVuqEjNm/Xx6Irpd0wMSipNp/JSX v5NDZhVzpQFP3GFOXu4ZepFlodeKOgZTg6fMrQ1FbLvj0q3rJmjRUKshe g2Sa1K/pzg7zSQ766mht2q4E3DWuNeNxxXud9VHkpcADPc1Wj/lsdWpvG w==; X-CSE-ConnectionGUID: kzBcBWY3TS6x7e3cKIeOxQ== X-CSE-MsgGUID: H4endoy3SCK3DsW6x+jYTw== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623037" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623037" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:36 -0700 X-CSE-ConnectionGUID: b3WkzRXdTkiqPyZP/HGnog== X-CSE-MsgGUID: rRXopY6dTjKDQkwp+cNsNg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076338" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:35 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 04/16] sched/cache: Calculate the LLC size and store it in sched_domain Date: Wed, 13 May 2026 13:39:15 -0700 Message-Id: <37afee09ff608034da0ce149e72d33b6f4698edf.1778703694.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Cache aware scheduling needs to know the LLC size that a process can use, so as to avoid memory-intensive tasks from being over-aggregated on a single LLC. Introduce a preparation patch to add get_effective_llc_bytes() to get the LLC size that a CPU can use. The function can be further enhanced by subtracting the LLC cache ways reserved by resctrl (CAT in Intel RDT, etc). Tested-by: Tingyin Duan Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- drivers/base/cacheinfo.c | 23 ++++++++ include/linux/cacheinfo.h | 1 + include/linux/sched/topology.h | 7 +++ kernel/sched/topology.c | 98 ++++++++++++++++++++++++++++++++-- 4 files changed, 126 insertions(+), 3 deletions(-) diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index 391ac5e3d2f5..70701d3bc81c 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -68,6 +69,24 @@ bool last_level_cache_is_valid(unsigned int cpu) =20 } =20 +/* + * Get the cacheinfo of the LLC associated with @cpu. + * Derived from update_per_cpu_data_slice_size_cpu(). + */ +struct cacheinfo *get_cpu_cacheinfo_llc(unsigned int cpu) +{ + struct cacheinfo *llc; + + if (!last_level_cache_is_valid(cpu)) + return NULL; + + llc =3D per_cpu_cacheinfo_idx(cpu, cache_leaves(cpu) - 1); + if (llc->type !=3D CACHE_TYPE_DATA && llc->type !=3D CACHE_TYPE_UNIFIED) + return NULL; + + return llc; +} + bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y) { struct cacheinfo *llc_x, *llc_y; @@ -1018,6 +1037,7 @@ static int cacheinfo_cpu_online(unsigned int cpu) goto err; if (cpu_map_shared_cache(true, cpu, &cpu_map)) update_per_cpu_data_slice_size(true, cpu, cpu_map); + sched_update_llc_bytes(cpu); return 0; err: free_cache_attributes(cpu); @@ -1036,6 +1056,9 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu) free_cache_attributes(cpu); if (nr_shared > 1) update_per_cpu_data_slice_size(false, cpu, cpu_map); + + sched_update_llc_bytes(cpu); + return 0; } =20 diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index c8f4f0a0b874..fc879ac4cc4f 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -89,6 +89,7 @@ int populate_cache_leaves(unsigned int cpu); int cache_setup_acpi(unsigned int cpu); bool last_level_cache_is_valid(unsigned int cpu); bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y); +struct cacheinfo *get_cpu_cacheinfo_llc(unsigned int cpu); int fetch_cache_info(unsigned int cpu); int detect_cache_attributes(unsigned int cpu); #ifndef CONFIG_ACPI_PPTT diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index 0036d6b4bd67..fe09d3268bc9 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -106,6 +106,7 @@ struct sched_domain { #ifdef CONFIG_SCHED_CACHE unsigned int llc_max; unsigned int *llc_counts __counted_by_ptr(llc_max); + unsigned long llc_bytes; #endif =20 #ifdef CONFIG_SCHEDSTATS @@ -265,4 +266,10 @@ static inline int task_node(const struct task_struct *= p) return cpu_to_node(task_cpu(p)); } =20 +#ifdef CONFIG_SCHED_CACHE +extern void sched_update_llc_bytes(unsigned int cpu); +#else +static inline void sched_update_llc_bytes(unsigned int cpu) { } +#endif + #endif /* _LINUX_SCHED_TOPOLOGY_H */ diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 9fc99346ef4f..7248a7279abe 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -776,9 +776,11 @@ cpu_attach_domain(struct sched_domain *sd, struct root= _domain *rd, int cpu) /* move buffer to parent as child is being destroyed */ sd->llc_counts =3D tmp->llc_counts; sd->llc_max =3D tmp->llc_max; + sd->llc_bytes =3D tmp->llc_bytes; /* make sure destroy_sched_domain() does not free it */ tmp->llc_counts =3D NULL; tmp->llc_max =3D 0; + tmp->llc_bytes =3D 0; #endif /* * sched groups hold the flags of the child sched @@ -831,10 +833,42 @@ DEFINE_STATIC_KEY_FALSE(sched_cache_active); /* user wants cache aware scheduling [0 or 1] */ int sysctl_sched_cache_user =3D 1; =20 +/* + * Get the effective LLC size in bytes that @cpu's bottom sched_domain + * can use. A CPU within a cpuset partition can only use a proportion + * of the physical LLC, scaled by the ratio of the partition's span + * weight to the hardware LLC sharing weight. @sd should be the + * topmost domain with SD_SHARE_LLC. + * + * Returns 0 if cacheinfo is not yet populated. This happens during + * early boot when build_sched_domains() runs before the generic + * cacheinfo framework has been initialized (cacheinfo_cpu_online() + * is a device_initcall cpuhp callback). In that case, + * cacheinfo_cpu_online() will later call sched_update_llc_bytes() + * to fill in the bottom domain's llc_bytes once the cache attributes + * are available. + */ +static unsigned long get_effective_llc_bytes(int cpu, + struct sched_domain *sd) +{ + struct cacheinfo *ci; + unsigned int hw_weight; + + ci =3D get_cpu_cacheinfo_llc(cpu); + if (!ci) + return 0; + + hw_weight =3D cpumask_weight(&ci->shared_cpu_map); + if (!hw_weight) + return 0; + + return div_u64((u64)ci->size * sd->span_weight, hw_weight); +} + static bool alloc_sd_llc(const struct cpumask *cpu_map, struct s_data *d) { - struct sched_domain *sd; + struct sched_domain *sd, *top_llc, *parent; unsigned int *p; int i; =20 @@ -848,8 +882,24 @@ static bool alloc_sd_llc(const struct cpumask *cpu_map, if (!p) goto err; =20 - sd->llc_max =3D max_lid + 1; - sd->llc_counts =3D p; + top_llc =3D sd; + /* + * Find the topmost SD_SHARE_LLC domain. + * Not yet attached to the CPU, so per_cpu(sd_llc, i) + * can not be used. + */ + while ((parent =3D rcu_dereference_protected(top_llc->parent, true)) && + (parent->flags & SD_SHARE_LLC)) + top_llc =3D parent; + + if (top_llc->flags & SD_SHARE_LLC) { + sd->llc_max =3D max_lid + 1; + sd->llc_counts =3D p; + sd->llc_bytes =3D get_effective_llc_bytes(i, top_llc); + } else { + /* avoid memory leak */ + kfree(p); + } } =20 return true; @@ -860,6 +910,7 @@ static bool alloc_sd_llc(const struct cpumask *cpu_map, kfree(sd->llc_counts); sd->llc_counts =3D NULL; sd->llc_max =3D 0; + sd->llc_bytes =3D 0; } } =20 @@ -919,6 +970,47 @@ void sched_cache_active_set_unlocked(void) { return sched_cache_active_set(false); } + +/* + * Update the bottom sched_domain's llc_bytes for @cpu and all its + * LLC siblings. Called from cacheinfo_cpu_online() or + * cacheinfo_cpu_pre_down() with cpu hotplug lock held. + * + * Note: get_effective_llc_bytes() returns 0 on PowerPC. + * thus cache aware scheduling is disabled on PowerPC for + * now. PowerPC does not use the generic cacheinfo framework -- + * it has its own cacheinfo with a separate struct cache hierarchy + * and does not populates the per-CPU struct cpu_cacheinfo array + * that get_cpu_cacheinfo_llc() reads. + */ +void sched_update_llc_bytes(unsigned int cpu) +{ + struct sched_domain *sd, *sdp; + unsigned int i; + + sched_domains_mutex_lock(); + + sdp =3D rcu_dereference_sched_domain(per_cpu(sd_llc, cpu)); + if (!sdp) + goto unlock; + + /* + * ci->shared_cpu_map is built incrementally as CPUs come + * online, so the first CPU in an LLC initially sees + * hw_weight =3D=3D 1 and computes an inflated llc_bytes in + * get_effective_llc_bytes(). Re-evaluating every LLC + * sibling on each online event corrects this once the full + * shared_cpu_map is known. + */ + for_each_cpu(i, sched_domain_span(sdp)) { + sd =3D rcu_dereference_sched_domain(cpu_rq(i)->sd); + if (sd) + sd->llc_bytes =3D get_effective_llc_bytes(i, sdp); + } + +unlock: + sched_domains_mutex_unlock(); +} #else static bool alloc_sd_llc(const struct cpumask *cpu_map, struct s_data *d) --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0B69525B0B6 for ; Wed, 13 May 2026 20:33:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704419; cv=none; b=Ai7CDkiu4GByoe/eOnM2gK6K9pMW2L9s1dGMU3oJT/gIZCPznTUl1VMF0Wt5OfS3OlHscRau8zHWhV41ZshWVWg7v9z45dMVfLjdaw6sEyA4hSciluvb1kZjnmO4+uRDSxFyOgOmFaiSbqRjfd3QPBjovI4BVplg1U2qA5GULJo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704419; c=relaxed/simple; bh=qCO/hC7dazYTOOBM6v2DtzEII3MhMEZahEUswvVOKpA=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=AP68ElUpR0soGaF5YOs+aqkLGzf6upWE2hZ5sRqOD+qo6XlHsTxZ82NCPoA/SCFYa4Jnbf3N1ZwgzTrIEkJD6CagF8/SBAY9IEyeQfeZdnJZeahBkSiEz0rQJpl8ZnLyU9oCtrnVdFzQUu2uNXD59r9zsYs111ubi7mLNUhXBTI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=c2k/pqnQ; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="c2k/pqnQ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704418; x=1810240418; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qCO/hC7dazYTOOBM6v2DtzEII3MhMEZahEUswvVOKpA=; b=c2k/pqnQI05yR7hZDBRDdJeJS4b4TTKW9g0R4rsl4i+0peeUaHMsmMWP 9ofdLrZHaHNd/Y/aOnT+ijUzSwmB5hCvoLGZq12hf0igaX3Z2uItDtC+x nwskJjGRxdiMVOb9B6BE9ZySGG9oJ5lJAcM0+oQL+ypXi5ABcYhv8psU/ DKhd2JH6Ioahom53b78rK/i36u6oBJYUOuhHKLjtTQHZ7Bd6yTTX6VA94 ILmLi4fAxJG+HCyhBisvkINRT+e9QXNSN8UcVAp5lLkf+aIF6w+f/1+w5 fJ3lokz4ahUAnAkuX9Ujzflxj+XwaH0R6X12CyukRfH8fijgHyp8nRYso w==; X-CSE-ConnectionGUID: OY51K5bIT/CpUZJryxmUBg== X-CSE-MsgGUID: DKl7l1SpR86RrbvJoUaeJg== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623061" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623061" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:37 -0700 X-CSE-ConnectionGUID: 8HXddkvBQ5KQdjdz7oOSag== X-CSE-MsgGUID: 1oPT4vC3QnOhZiQHy/K1BA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076344" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:37 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 05/16] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Date: Wed, 13 May 2026 13:39:16 -0700 Message-Id: <95cf64a385bcc12f18dcebe9d59e8d3ba8bb318f.1778703694.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Prateek and Tingyin reported that memory-intensive workloads (such as stream) can saturate memory bandwidth and caches on the preferred LLC when sched_cache aggregates too many threads. To mitigate this, estimate a process's memory footprint by comparing its NUMA balancing fault statistics to the size of the LLC. If the footprint exceeds the LLC size, skip cache-aware scheduling. Note that footprint is only an approximation of the memory footprint, since the kernel lacks suitable metrics to estimate the real working set. If a user-provided hint is available in the future, it would be more accurate. A later patch will allow users to provide a hint to adjust this threshold. Tested-by: Tingyin Duan Suggested-by: K Prateek Nayak Suggested-by: Vern Hao Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- include/linux/sched.h | 1 + kernel/exit.c | 29 ++++++++++++++++++++ kernel/sched/fair.c | 62 ++++++++++++++++++++++++++++++++++++++++--- 3 files changed, 89 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6701911eaaf7..95729670929c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2425,6 +2425,7 @@ struct sched_cache_stat { unsigned long epoch; u64 nr_running_avg; unsigned long next_scan; + unsigned long footprint; int cpu; } ____cacheline_aligned_in_smp; =20 diff --git a/kernel/exit.c b/kernel/exit.c index ede3117fa7d4..77275c26a2a1 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -543,6 +543,32 @@ void mm_update_next_owner(struct mm_struct *mm) } #endif /* CONFIG_MEMCG */ =20 +#if defined(CONFIG_SCHED_CACHE) && defined(CONFIG_NUMA_BALANCING) +/* + * Subtract the memory footprint of the current task from + * mm. + */ +static void exit_mm_sched_cache(struct mm_struct *mm) +{ + unsigned long fp, sub; + + if (!current->total_numa_faults) + return; + /* + * No lock protection due to performance considerations. + * Make sure mm->sc_stat.footprint does not become + * negative. + */ + fp =3D READ_ONCE(mm->sc_stat.footprint); + sub =3D min(fp, current->total_numa_faults); + WRITE_ONCE(mm->sc_stat.footprint, fp - sub); +} +#else +static inline void exit_mm_sched_cache(struct mm_struct *mm) +{ +} +#endif /* CONFIG_SCHED_CACHE CONFIG_NUMA_BALANCING */ + /* * Turn us into a lazy TLB process if we * aren't already.. @@ -554,6 +580,9 @@ static void exit_mm(void) exit_mm_release(current, mm); if (!mm) return; + + exit_mm_sched_cache(mm); + mmap_read_lock(mm); mmgrab_lazy_tlb(mm); BUG_ON(mm !=3D current->active_mm); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index df21366ba1ca..a10116ffe0d1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1384,6 +1384,32 @@ static int llc_id(int cpu) return per_cpu(sd_llc_id, cpu); } =20 +static bool exceed_llc_capacity(struct mm_struct *mm, int cpu) +{ +#ifdef CONFIG_NUMA_BALANCING + unsigned long llc, footprint; + struct sched_domain *sd; + + guard(rcu)(); + + sd =3D rcu_dereference_sched_domain(cpu_rq(cpu)->sd); + if (!sd) + return true; + + if (static_branch_likely(&sched_numa_balancing)) { + /* + * TBD: RDT exclusive LLC ways reserved should be + * excluded. + */ + llc =3D sd->llc_bytes; + footprint =3D READ_ONCE(mm->sc_stat.footprint); + + return (llc < (footprint * PAGE_SIZE)); + } +#endif + return false; +} + static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p, int cpu) { @@ -1463,6 +1489,7 @@ void mm_init_sched(struct mm_struct *mm, mm->sc_stat.cpu =3D -1; mm->sc_stat.next_scan =3D jiffies; mm->sc_stat.nr_running_avg =3D 0; + mm->sc_stat.footprint =3D 0; /* * The update to mm->sc_stat should not be reordered * before initialization to mm's other fields, in case @@ -1585,7 +1612,8 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) * its preferred state. */ if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT || - invalid_llc_nr(mm, p, cpu_of(rq))) { + invalid_llc_nr(mm, p, cpu_of(rq)) || + exceed_llc_capacity(mm, cpu_of(rq))) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; } @@ -1716,7 +1744,8 @@ static void task_cache_work(struct callback_head *wor= k) return; =20 curr_cpu =3D task_cpu(p); - if (invalid_llc_nr(mm, p, curr_cpu)) { + if (invalid_llc_nr(mm, p, curr_cpu) || + exceed_llc_capacity(mm, curr_cpu)) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; =20 @@ -3515,6 +3544,7 @@ static void task_numa_placement(struct task_struct *p) unsigned long total_faults; u64 runtime, period; spinlock_t *group_lock =3D NULL; + long __maybe_unused new_fp; struct numa_group *ng; =20 /* @@ -3589,6 +3619,31 @@ static void task_numa_placement(struct task_struct *= p) ng->total_faults +=3D diff; group_faults +=3D ng->faults[mem_idx]; } +#ifdef CONFIG_SCHED_CACHE + /* + * Per task p->numa_faults[mem_idx] converges, + * so the accumulation of each task's faults + * converges too - Given the number of threads, + * it cannot overflow an unsigned long. + * Racy with concurrent updates from other threads + * sharing this mm. Acceptable since footprint is a + * heuristic and occasional lost updates are tolerable. + * + * If a task exits, its corresponding footprint must + * be subtracted from the mm->sc_stat.footprint, otherwise + * the mm->sc_stat.footprint will not converge: + * the exiting thread's footprint remains unchanged/undecayed + * in mm->sc_stat.footprint. See exit_mm(). + * + * Lost updates and unsynchronized subtraction + * in exit_mm() can cause footprint + diff to + * go negative. Clamp to zero to prevent the + * unsigned footprint from wrapping. + */ + new_fp =3D (long)READ_ONCE(p->mm->sc_stat.footprint) + diff; + WRITE_ONCE(p->mm->sc_stat.footprint, + max(new_fp, 0L)); +#endif } =20 if (!ng) { @@ -10338,7 +10393,8 @@ static enum llc_mig can_migrate_llc_task(int src_cp= u, int dst_cpu, return mig_unrestricted; =20 /* skip cache aware load balance for too many threads */ - if (invalid_llc_nr(mm, p, dst_cpu)) { + if (invalid_llc_nr(mm, p, dst_cpu) || + exceed_llc_capacity(mm, dst_cpu)) { if (mm->sc_stat.cpu !=3D -1) mm->sc_stat.cpu =3D -1; return mig_unrestricted; --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E8B81382F1A for ; Wed, 13 May 2026 20:33:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704420; cv=none; b=BfnNM3dxg2q8u+3/3UzvObC/fpsC64lYeXnNrA61Ti/5qHL1aojgppc48XDHy2i87UqsEstPO4NO7NgCdOF47cSitX/R/vdLOk/ZSVjcyMv5i50RLy1yeeb5rmkO0K2HdootLTSe/6RlVp3pxyzD2cBHkus71+0gberAmwHRBj4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704420; c=relaxed/simple; bh=My8qPX4ZhhSb/OISYHVK8VtGkA3KRPphT3KhlOl4m1w=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=k2ToAVf7JagLN5Hc/DzTpEuhsi+J5/8twfijGm9Ha0Cg8cWUDQbfi/S2QPYLSac9bka2f2RfRCYjN3VfIVKW5HgqqFLcIzMB5yA0aiSkBBHt5hSxKMD6WY1hLggvJLPxnq486/3bKUU/J97fcapHMGyBMU6B1nK+tGn28SwHkGA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=HpcNdGND; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="HpcNdGND" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704419; x=1810240419; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=My8qPX4ZhhSb/OISYHVK8VtGkA3KRPphT3KhlOl4m1w=; b=HpcNdGNDW4pUV++fypwQN7/IDIQia2bE4BllSI1TE/1AC7VWDpJfo9Pw wE1jlWy2TjwABVfgaQ8zQOXUqpSoPZwdQsujbRIK0lYKyCLIyEI92oPO8 dGm0Cst4gOPXprFgDeBF0EPO1eIIzxq68QKmimKGvbd5rlRmJOdy9rm9g LH7SGWY1rqnae/3tH69iFv91YnLAXHEOuFzwFrTTIf9NOvuSfyXaUu1tU JzgwuvzLOYonJ3KHRfpDy7aAuqlFi68+i/FwWNrrm8//POqHdG8j2lbuK 052lMk8OQjcvb0zr3XEM3CTctMZBOjMxUV0Hx0SMPlICH0KL9N7AjqLK1 g==; X-CSE-ConnectionGUID: TZMwbnxJRRi7XTgKohedTw== X-CSE-MsgGUID: IaEnJwDKS7yx+rXABrMo0A== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623082" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623082" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:38 -0700 X-CSE-ConnectionGUID: rrDAu57tQrGYMBagZkOHPg== X-CSE-MsgGUID: kMYKS0VHT+K8HUuq30u4PA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076352" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:38 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 06/16] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Date: Wed, 13 May 2026 13:39:17 -0700 Message-Id: <1c62cc060ba2b33d7b1f0ed98b3390128edbae93.1778703694.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu Introduce a set of debugfs knobs to control how aggressively the cache aware scheduling does the task aggregation. (1) aggr_tolerance With sched_cache enabled, the scheduler uses a process's footprint as a proxy for its LLC footprint to determine if aggregating tasks on the preferred LLC could cause cache contention. If the footprint exceeds the LLC size, aggregation is skipped. Since the kernel cannot efficiently track per-task cache usage (resctrl is user-space only), userspace can provide a more accurate hint. Introduce /sys/kernel/debug/sched/llc_balancing/aggr_tolerance to let users control how strictly footprint limits aggregation. Values range from 0 to 100: - 0: Cache-aware scheduling is disabled. - 1: Strict; tasks with footprint larger than LLC size are skipped. - >=3D100: Aggressive; tasks are aggregated regardless of footprint. For example, with a 32MB L3 cache: - aggr_tolerance=3D1 -> tasks with footprint > 32MB are skipped. - aggr_tolerance=3D99 -> tasks with footprint > 784GB are skipped (784GB =3D (1 + (99 - 1) * 256) * 32MB). Similarly, /sys/kernel/debug/sched/llc_balancing/aggr_tolerance also controls how strictly the number of active threads is considered when doing cache aware load balance. The number of SMTs is also considered. High SMT counts reduce the aggregation capacity, preventing excessive task aggregation on SMT-heavy systems like Power10/Power11. Yangyu suggested introducing separate aggregation controls for the number of active threads and memory footprint checks. Since there are plans to add per-process/task group controls, fine-grained tunables are deferred to that implementation. (2) epoch_period, epoch_affinity_timeout, imb_pct, overaggr_pct are also turned into tunables. Tested-by: Tingyin Duan Suggested-by: K Prateek Nayak Suggested-by: Madadi Vineeth Reddy Suggested-by: Shrikanth Hegde Suggested-by: Tingyin Duan Suggested-by: Jianyong Wu Suggested-by: Yangyu Chen Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- kernel/sched/debug.c | 10 +++++++ kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++++++------ kernel/sched/sched.h | 5 ++++ 3 files changed, 75 insertions(+), 8 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 2eae67cd2ba2..fe569539e888 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -670,6 +670,16 @@ static __init int sched_init_debug(void) llc =3D debugfs_create_dir("llc_balancing", debugfs_sched); debugfs_create_file("enabled", 0644, llc, NULL, &sched_cache_enable_fops); + debugfs_create_u32("aggr_tolerance", 0644, llc, + &llc_aggr_tolerance); + debugfs_create_u32("epoch_period", 0644, llc, + &llc_epoch_period); + debugfs_create_u32("epoch_affinity_timeout", 0644, llc, + &llc_epoch_affinity_timeout); + debugfs_create_u32("overaggr_pct", 0644, llc, + &llc_overaggr_pct); + debugfs_create_u32("imb_pct", 0644, llc, + &llc_imb_pct); #endif =20 debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops= ); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a10116ffe0d1..01ce646792ff 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1375,6 +1375,11 @@ static void set_next_buddy(struct sched_entity *se); */ #define EPOCH_PERIOD (HZ / 100) /* 10 ms */ #define EPOCH_LLC_AFFINITY_TIMEOUT 5 /* 50 ms */ +__read_mostly unsigned int llc_aggr_tolerance =3D 1; +__read_mostly unsigned int llc_epoch_period =3D EPOCH_PERIOD; +__read_mostly unsigned int llc_epoch_affinity_timeout =3D EPOCH_LLC_AFFINI= TY_TIMEOUT; +__read_mostly unsigned int llc_imb_pct =3D 20; +__read_mostly unsigned int llc_overaggr_pct =3D 50; =20 static int llc_id(int cpu) { @@ -1384,11 +1389,25 @@ static int llc_id(int cpu) return per_cpu(sd_llc_id, cpu); } =20 +static inline int get_sched_cache_scale(int mul) +{ + unsigned int tol =3D READ_ONCE(llc_aggr_tolerance); + + if (!tol) + return 0; + + if (tol >=3D 100) + return INT_MAX; + + return (1 + (tol - 1) * mul); +} + static bool exceed_llc_capacity(struct mm_struct *mm, int cpu) { #ifdef CONFIG_NUMA_BALANCING unsigned long llc, footprint; struct sched_domain *sd; + int scale; =20 guard(rcu)(); =20 @@ -1404,7 +1423,28 @@ static bool exceed_llc_capacity(struct mm_struct *mm= , int cpu) llc =3D sd->llc_bytes; footprint =3D READ_ONCE(mm->sc_stat.footprint); =20 - return (llc < (footprint * PAGE_SIZE)); + /* + * Scale the LLC size by 256*llc_aggr_tolerance + * and compare it to the task's footprint. + * + * Suppose the L3 size is 32MB. If the + * llc_aggr_tolerance is 1: + * When the footprint is larger than 32MB, the + * process is regarded as exceeding the LLC + * capacity. If the llc_aggr_tolerance is 99: + * When the footprint is larger than 784GB, the + * process is regarded as exceeding the LLC + * capacity: + * 784GB =3D (1 + (99 - 1) * 256) * 32MB + * If the llc_aggr_tolerance is 100: + * ignore the footprint and do the aggregation + * anyway. + */ + scale =3D get_sched_cache_scale(256); + if (scale =3D=3D INT_MAX) + return false; + + return ((llc * (u64)scale) < (footprint * PAGE_SIZE)); } #endif return false; @@ -1413,11 +1453,21 @@ static bool exceed_llc_capacity(struct mm_struct *m= m, int cpu) static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p, int cpu) { + int scale; + if (get_nr_threads(p) <=3D 1) return true; =20 + /* + * Scale the number of 'cores' in a LLC by llc_aggr_tolerance + * and compare it to the task's active threads. + */ + scale =3D get_sched_cache_scale(1); + if (scale =3D=3D INT_MAX) + return false; + return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads), - per_cpu(sd_llc_size, cpu)); + (scale * per_cpu(sd_llc_size, cpu))); } =20 static void account_llc_enqueue(struct rq *rq, struct task_struct *p) @@ -1513,13 +1563,14 @@ static inline void __update_mm_sched(struct rq *rq, { lockdep_assert_held(&rq->cpu_epoch_lock); =20 + unsigned int period =3D max(READ_ONCE(llc_epoch_period), 1U); unsigned long n, now =3D jiffies; long delta =3D now - rq->cpu_epoch_next; =20 if (delta > 0) { - n =3D (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD; + n =3D (delta + period - 1) / period; rq->cpu_epoch +=3D n; - rq->cpu_epoch_next +=3D n * EPOCH_PERIOD; + rq->cpu_epoch_next +=3D n * period; __shr_u64(&rq->cpu_runtime, n); } =20 @@ -1611,7 +1662,7 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) * If this process hasn't hit task_cache_work() for a while invalidate * its preferred state. */ - if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT || + if (epoch - READ_ONCE(mm->sc_stat.epoch) > llc_epoch_affinity_timeout || invalid_llc_nr(mm, p, cpu_of(rq)) || exceed_llc_capacity(mm, cpu_of(rq))) { if (mm->sc_stat.cpu !=3D -1) @@ -1740,7 +1791,8 @@ static void task_cache_work(struct callback_head *wor= k) =20 /* only 1 thread is allowed to scan */ if (!try_cmpxchg(&mm->sc_stat.next_scan, &next_scan, - now + EPOCH_PERIOD)) + now + max_t(unsigned long, + READ_ONCE(llc_epoch_period), 1))) return; =20 curr_cpu =3D task_cpu(p); @@ -10232,7 +10284,7 @@ static inline int task_is_ineligible_on_dst_cpu(str= uct task_struct *p, int dest_ */ static bool fits_llc_capacity(unsigned long util, unsigned long max) { - u32 aggr_pct =3D 50; + u32 aggr_pct =3D llc_overaggr_pct; =20 /* * For single core systems, raise the aggregation @@ -10252,7 +10304,7 @@ static bool fits_llc_capacity(unsigned long util, u= nsigned long max) */ /* Allows dst util to be bigger than src util by up to bias percent */ #define util_greater(util1, util2) \ - ((util1) * 100 > (util2) * 120) + ((util1) * 100 > (util2) * (100 + llc_imb_pct)) =20 static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util, unsigned long *cap) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f499d5dd1130..27409399137c 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -4072,6 +4072,11 @@ static inline void mm_cid_switch_to(struct task_stru= ct *prev, struct task_struct DECLARE_STATIC_KEY_FALSE(sched_cache_present); DECLARE_STATIC_KEY_FALSE(sched_cache_active); extern int sysctl_sched_cache_user; +extern unsigned int llc_aggr_tolerance; +extern unsigned int llc_epoch_period; +extern unsigned int llc_epoch_affinity_timeout; +extern unsigned int llc_imb_pct; +extern unsigned int llc_overaggr_pct; =20 static inline bool sched_cache_enabled(void) { --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F13AA1C84BB for ; Wed, 13 May 2026 20:33:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704425; cv=none; b=R2svB5wIM888zLWp05qrO8PLxcGhve11o/utv44aFztYvmftvh0yd10k54qGXPSl7Kse3Z/zgkr/1dFkiVpP5jRs7FaK+KSma1uSytD64imEmFGMc6yG65Y54vYHmiaLNndvLEVI4OHIrMZc2V5Wl+Y/mQ+/CfulOKY81CDzjB0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704425; c=relaxed/simple; bh=QpfAZD5+RHmNHPnMXUPpxcOBAUzJXhzR1ngj1b6Stms=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=SLfiMX4mSqjnY+kIQ1isiUVDkkUSGUcTGA6ruOKrJP3tBF3xzKUB0KMbWmuGYN9AQ40wrz85WQipm8ybpXwoAN44DRrqOgbU1Y0CY03ooVi2CZjNsQgQjSJ6iXyE/xdoW0gXAfjXIshjlEeNe90Fm2E43Uhrp6wYlSw7nPJ/FNo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=Qm7e2MxG; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="Qm7e2MxG" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704420; x=1810240420; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=QpfAZD5+RHmNHPnMXUPpxcOBAUzJXhzR1ngj1b6Stms=; b=Qm7e2MxGlEV2VUJw7tBbmUeyHlHNdFXEyjH3t9q1MMh4UcGolmVqGBvJ pmB1fPzkidyDBJOxlPvgWo/nlLlQzFUpU6HCF66fmvn6+S6kS27oADkwg CsHj4IOBKYELpwTi1EOOJSQr/83/0XbWRdhLRHw1F8KvF8M2JBo/p+yYf szTqLvlpGhImBIyrNuAFL7qr2kC7gP3Jx3EPTUP77RTYxp6/fVbgL4fGN V05IN3JdYQwMaAI1fGlE6KP+WE0mDD7V//Em3uhnfEUan9C4zs2M41FEP UUtD7ozLhJuyKK4vkzbBSMR6H0LabHbdr3iJlPcGu+p9KVZu9Rs7K9Icn w==; X-CSE-ConnectionGUID: fIRuwq/aTaKgkAQmE+4Ntg== X-CSE-MsgGUID: 0O0sWxMqRhGWBPJKfVaWaw== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623104" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623104" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:40 -0700 X-CSE-ConnectionGUID: BHD3mRTgSFuyz3N4yveAJA== X-CSE-MsgGUID: e708Cl5vSlmRux+CgJue2g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076359" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:39 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 07/16] sched/cache: Fix rcu warning when accessing sd_llc domain Date: Wed, 13 May 2026 13:39:18 -0700 Message-Id: <2dc49455e861215d8059a1c877953f0b95990038.1778703694.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu rcu_dereference_all() should be used to access the sd_llc domain under RCU protection. This bug was reported by sashiko. Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware= load balancing") Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 01ce646792ff..be96d80c9310 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1814,7 +1814,7 @@ static void task_cache_work(struct callback_head *wor= k) =20 for_each_cpu(cpu, cpus) { /* XXX sched_cluster_active */ - struct sched_domain *sd =3D per_cpu(sd_llc, cpu); + struct sched_domain *sd =3D rcu_dereference_all(per_cpu(sd_llc, cpu)); unsigned long occ, m_occ =3D 0, a_occ =3D 0; int m_cpu =3D -1, i; =20 --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 66F36379C33 for ; Wed, 13 May 2026 20:33:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704430; cv=none; b=hKnbMvU4rY/UERAim6mTLXlmRjOINyX30iZuJmicTVefZmZP70ljSQDSV/01EfzwfJvtgZa96lq0vNpzwV8sCfhjPeRB0zuzr2LHrWx+vF4xQol0woyr576E0c8gdgAT5Oc5FuGgqJZZLb0PhYvX7uiJx6Dsi8YZIxqaJ/2R7JI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704430; c=relaxed/simple; bh=X7rEYXsps9lolKowVjX6yfvNXC4HUV/AW1S5DZm0CWw=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=aecz0eyBM/8uH/qLTYIvTQvdV9rxjxrOPTO94cHntb5UWDWPcTAfx+2SdyveP0yX59gWHrQIK5tz+Paq8MVgCzsBtofCSVAIYCRZMlwloovbAVE26fNlFpxk75CCH3Q81g5B26kmmXaGb++LW0solWfcxS+D8HJoKY9OFOME/ro= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=HaVik9vd; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="HaVik9vd" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704421; x=1810240421; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=X7rEYXsps9lolKowVjX6yfvNXC4HUV/AW1S5DZm0CWw=; b=HaVik9vd5050uzo8wv7tNjMM3riyOw6umXw4iJ5JntEtfmPfPEjtBYDS Br19V4wjBmDP/QENT/8hfHLFaoKwXlH0i734tF8O/2rY47vBvucV7jFNx n9EfRUqxBjE8/NtRQsAnpawPfip4yEFQRpYH4dnmM64d/Cy9X9zAXVmF+ 1I83oCMKeW4/YKpOCIZ5ZgSViEEDmVpj3QdqLBbUhbHgDcV6SFjrSyUyT F+eT71hTuWSANINmNJ2QjBX9sfsQ+c/v6/K//Ew7/k42t8vB2doh1Yy8c UAVn/PFsy5TwUcVIilQjnjGe4UaRyNBRA3HS2bk6/4Pfz6eXSU9LYesat g==; X-CSE-ConnectionGUID: jjElaELlQlqt0Pw5HhF/1Q== X-CSE-MsgGUID: mQrHFYrgRAaLN7SGht0JNQ== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623127" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623127" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:41 -0700 X-CSE-ConnectionGUID: bU0/dPyARu6YYe+GqnizOw== X-CSE-MsgGUID: gl0WTuQLQouL4gu8cgT/fg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076374" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:40 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 08/16] sched/cache: Fix potential NULL mm pointer access Date: Wed, 13 May 2026 13:39:19 -0700 Message-Id: <066d8cfa45d4822bf4367e788c50377c66bbcc82.1778703694.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu A concurrent task exit might cause a NULL pointer dereference in account_mm_sched(). Use the locally cached mm pointer instead, since the active_mm reference guarantees the structure remains allocated. Meanwhile, skip the kernel thread because it has nothing to do with cache aware scheduling. This bug was reported by sashiko and Vern. Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware= load balancing") Reported-by: Vern Hao Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen Link: https://lore.kernel.org/all/09cf7ee3-6e27-4505-9692-4b4a4707c8b2@gmai= l.com/ --- kernel/sched/fair.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index be96d80c9310..913b09254732 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1649,7 +1649,7 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) if (!mm || !mm->sc_stat.pcpu_sched) return; =20 - pcpu_sched =3D per_cpu_ptr(p->mm->sc_stat.pcpu_sched, cpu_of(rq)); + pcpu_sched =3D per_cpu_ptr(mm->sc_stat.pcpu_sched, cpu_of(rq)); =20 scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) { __update_mm_sched(rq, pcpu_sched); @@ -1689,7 +1689,8 @@ static void task_tick_cache(struct rq *rq, struct tas= k_struct *p) if (!sched_cache_enabled()) return; =20 - if (!mm || !mm->sc_stat.pcpu_sched) + if (!mm || p->flags & PF_KTHREAD || + !mm->sc_stat.pcpu_sched) return; =20 epoch =3D rq->cpu_epoch; --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EFE66382F1E for ; Wed, 13 May 2026 20:33:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704431; cv=none; b=g78Ij0kN79U0abRJhSepyo/SGyV1+pr6oqKClpk/+XjCKC0SUwtVXUDU50zeP+PdaOJVYEXxwWpKFPSV5CTQKU+H2AGndXwMtWjJXf0yb6zfNo2qZ6ud8Vk/aNKyQu+a56XqNKpRirllRTZhB8bZcLSm00Wo66bYCfPJ3OPGN7E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704431; c=relaxed/simple; bh=Hjr3I2K3sS9NXaWhU0v/hUPqGrqUl+Q6KUJXkwPe5g8=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=sotYRLjYUat5MRn5xgmrCCela/wWUZ0RekMlLnQa4YpCaFv3As46KsC0iJNW3cReyzCMRwWNktil6v7j/A8hzzc2U8Yx8wbhnbe0ZnXLF7zcnMhpSzrY8x320XaDSkTK95/1WuNabXrc5zXkx2EsSdQTcoHSVALyDmpdwroXl7w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=HYnTiCWY; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="HYnTiCWY" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704423; x=1810240423; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Hjr3I2K3sS9NXaWhU0v/hUPqGrqUl+Q6KUJXkwPe5g8=; b=HYnTiCWYOKGaGwRhcG2O4KL+npu3VudnaL7Szvoiw9Ag3kW0Vn4HNjA6 Nl3L6DjnsWU1OAoBX94Od9DYZCnT0a3Hd90934xN1hBrazLLxvEJ7CCRp atnvO6tS3CuQuKNy49GFlpbB7wvfhWFhinbXQ1nccoQllmPa1JSt89JV9 8wtBTAlvMyHc8QhUHERK0ytPC3U34hmR+f3ytQDOfThOZT7ODseX03jHR CGPiuRgqdx3bm6T+6sQ3LkK1DH/r2EYXjiNAjSxiSdK9p8Zcq50G56Nje nx+gmONn2bdtCvyZPN+w+Q/95ieTxVHcxySd7IZD1xKzCh7fkaEhVLwW7 g==; X-CSE-ConnectionGUID: Z4zLvo2LRLmzkVG/qjCgvQ== X-CSE-MsgGUID: YsuJsQbzRkCGlW6T4rc7Ig== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623151" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623151" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:42 -0700 X-CSE-ConnectionGUID: zoYX+OWCQmG7l1zeSc0pBQ== X-CSE-MsgGUID: acUGvIlsRGSUvc/uvvqw8w== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076384" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:41 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 09/16] sched/cache: Annotate lockless accesses to mm->sc_stat.cpu Date: Wed, 13 May 2026 13:39:20 -0700 Message-Id: <63ea494f12efcf265d7134400a06cd75d7f2c310.1778703694.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu mm->sc_stat.cpu is written by task_cache_work() and could be read locklessly by several functions on other CPUs. Use READ_ONCE and WRITE_ONCE on mm->sc_stat.cpu access and write to prevent inconsistent values from compiler optimizations when there are multiple accesses. For example in get_pref_llc(), if the writer updated the field between two compiler-generated loads, the validation (e.g., cpu !=3D -1) and subsequent use (e.g., llc_id(cpu)) could operate on different values, allowing a negative CPU ID to be used as an index. Leave plain write in mm_init_sched(), where the mm is not yet visible to other CPUs. This bug was reported by sashiko. Fixes: 47d8696b95f7 ("sched/cache: Assign preferred LLC ID to processes") Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- kernel/sched/fair.c | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 913b09254732..73f185ba6e48 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1598,13 +1598,14 @@ static unsigned long fraction_mm_sched(struct rq *r= q, =20 static int get_pref_llc(struct task_struct *p, struct mm_struct *mm) { - int mm_sched_llc =3D -1; + int mm_sched_llc =3D -1, mm_sched_cpu; =20 if (!mm) return -1; =20 - if (mm->sc_stat.cpu !=3D -1) { - mm_sched_llc =3D llc_id(mm->sc_stat.cpu); + mm_sched_cpu =3D READ_ONCE(mm->sc_stat.cpu); + if (mm_sched_cpu !=3D -1) { + mm_sched_llc =3D llc_id(mm_sched_cpu); =20 #ifdef CONFIG_NUMA_BALANCING /* @@ -1619,7 +1620,7 @@ static int get_pref_llc(struct task_struct *p, struct= mm_struct *mm) */ if (static_branch_likely(&sched_numa_balancing) && p->numa_preferred_nid >=3D 0 && - cpu_to_node(mm->sc_stat.cpu) !=3D p->numa_preferred_nid) + cpu_to_node(mm_sched_cpu) !=3D p->numa_preferred_nid) mm_sched_llc =3D -1; #endif } @@ -1665,8 +1666,8 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) if (epoch - READ_ONCE(mm->sc_stat.epoch) > llc_epoch_affinity_timeout || invalid_llc_nr(mm, p, cpu_of(rq)) || exceed_llc_capacity(mm, cpu_of(rq))) { - if (mm->sc_stat.cpu !=3D -1) - mm->sc_stat.cpu =3D -1; + if (READ_ONCE(mm->sc_stat.cpu) !=3D -1) + WRITE_ONCE(mm->sc_stat.cpu, -1); } =20 mm_sched_llc =3D get_pref_llc(p, mm); @@ -1714,7 +1715,7 @@ static void get_scan_cpumasks(cpumask_var_t cpus, str= uct task_struct *p) if (!static_branch_likely(&sched_numa_balancing)) goto out; =20 - cpu =3D p->mm->sc_stat.cpu; + cpu =3D READ_ONCE(p->mm->sc_stat.cpu); if (cpu !=3D -1) nid =3D cpu_to_node(cpu); curr_cpu =3D task_cpu(p); @@ -1799,8 +1800,8 @@ static void task_cache_work(struct callback_head *wor= k) curr_cpu =3D task_cpu(p); if (invalid_llc_nr(mm, p, curr_cpu) || exceed_llc_capacity(mm, curr_cpu)) { - if (mm->sc_stat.cpu !=3D -1) - mm->sc_stat.cpu =3D -1; + if (READ_ONCE(mm->sc_stat.cpu) !=3D -1) + WRITE_ONCE(mm->sc_stat.cpu, -1); =20 return; } @@ -1857,7 +1858,7 @@ static void task_cache_work(struct callback_head *wor= k) m_a_cpu =3D m_cpu; } =20 - if (llc_id(cpu) =3D=3D llc_id(mm->sc_stat.cpu)) + if (llc_id(cpu) =3D=3D llc_id(READ_ONCE(mm->sc_stat.cpu))) curr_m_a_occ =3D a_occ; =20 cpumask_andnot(cpus, cpus, sched_domain_span(sd)); @@ -1875,7 +1876,7 @@ static void task_cache_work(struct callback_head *wor= k) * 3. 2X is chosen based on test results, as it delivers * the optimal performance gain so far. */ - mm->sc_stat.cpu =3D m_a_cpu; + WRITE_ONCE(mm->sc_stat.cpu, m_a_cpu); } =20 update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running); @@ -10441,15 +10442,15 @@ static enum llc_mig can_migrate_llc_task(int src_= cpu, int dst_cpu, if (!mm) return mig_unrestricted; =20 - cpu =3D mm->sc_stat.cpu; + cpu =3D READ_ONCE(mm->sc_stat.cpu); if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu)) return mig_unrestricted; =20 /* skip cache aware load balance for too many threads */ if (invalid_llc_nr(mm, p, dst_cpu) || exceed_llc_capacity(mm, dst_cpu)) { - if (mm->sc_stat.cpu !=3D -1) - mm->sc_stat.cpu =3D -1; + if (READ_ONCE(mm->sc_stat.cpu) !=3D -1) + WRITE_ONCE(mm->sc_stat.cpu, -1); return mig_unrestricted; } =20 --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 847E8379C3B for ; Wed, 13 May 2026 20:33:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704433; cv=none; b=PP0CAeAOB9e6+wAv8gyRLOijdaRQVKQRwqo8k5wbCWIHDk0oifkpopLS8EZIlHZ4Zv4lXierE9jooFoYjLcKU/C3wGPUbLTGasYpjuq/utKG/aV8Iup7BWdttzM6h1hzfOBfV6lpD3cu8Z3J1UhI1eH5ajMon2FxQ4Lhh4NC6xo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704433; c=relaxed/simple; bh=YESSvGLYZOHvqekIBAvKifjum8PdrXsf2pmyO9yVqwk=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=I1nnfZhJXNDEJ2RBbh2gyM+mVkbGykBtNvQyDI93YTrJID6k2b7w0yQigmsnXDWzjJLTAyakOtCNJNj2GSdbCLb6r5ULF8FGv/aTuuweokA/BLSxDwY1sOevovs6UlSmX1zVj/R4xUGoe6GE3nDbuBcw0skbkfONA6xtN0SpES4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=igLz2Ylj; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="igLz2Ylj" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704427; x=1810240427; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=YESSvGLYZOHvqekIBAvKifjum8PdrXsf2pmyO9yVqwk=; b=igLz2YljQ5XhHF31ec/jTtjLTyiZUE0ny9Xmhm0fqivVHu6dECIZDgr6 LlI6BPm/Oo8VI6Wwdy15j6EIfyNOMoL6ZJIJ9RjCFd5FCIQPL1EXHaZwk vrf2kM7vBLiiinmBo/oy3e5xdvfvTQoi166icNh00YCVUbbt7LpW33FRH glFnZ7G6Ta0mVvrIBYmtZWVGGokPNdtwkoFD2/Xv35lqbUcpjEl/bNQh9 89jMa04zRjDm7xlKFjQbpX+L78w2Pd+Uz+KFhf+ejmnrqEosKAMJMMQZB 8GcyxTOq5ZbgjrHv90Gww916XZsQDPnO89PrKbsTCj3unN+AoL0aBfbSz Q==; X-CSE-ConnectionGUID: 6ncHlTb+TQOJWfRCMOKKSw== X-CSE-MsgGUID: amQ8MBS+RfSy4wzuJlvY0A== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623176" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623176" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:43 -0700 X-CSE-ConnectionGUID: shjoP4GxTQqdqWA/pM63sg== X-CSE-MsgGUID: 8v4OE/Z4TYaR1iaTMQdBlw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076389" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:43 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 10/16] sched/cache: Fix unpaired account_llc_enqueue/dequeue Date: Wed, 13 May 2026 13:39:21 -0700 Message-Id: <0c8c6a1571d66792a4d2ff0103ba3cc13e059046.1778703694.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu There is a race condition that, after a task is enqueued on a runqueue, task_llc(p) may change due to CPU hotplug, because the llc_id is dynamically allocated and adjusted at runtime. Therefore, checking task_llc(p) to determine whether the task is being dequeued from its preferred LLC is unreliable and can cause inconsistent values. To fix this problem, record whether p is enqueued on its preferred LLC, in order to pair with account_llc_dequeue() to maintain a consistent nr_pref_llc_running per runqueue. This bug was reported by sashiko, and the solution was once suggested by Prateek. Fixes: 46afe3af7ead ("sched/cache: Track LLC-preferred tasks per runqueue") Suggested-by: K Prateek Nayak Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- include/linux/sched.h | 2 ++ init/init_task.c | 1 + kernel/sched/fair.c | 31 ++++++++++++++++++++++++++++--- 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 95729670929c..2c9e8e2edde1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1410,6 +1410,8 @@ struct task_struct { #ifdef CONFIG_SCHED_CACHE struct callback_head cache_work; int preferred_llc; + /* 1: task was enqueued to its preferred LLC, 0 otherwise */ + int pref_llc_queued; #endif =20 struct rseq_data rseq; diff --git a/init/init_task.c b/init/init_task.c index 5d90db4ff1f8..3ecd66fbd563 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -217,6 +217,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = =3D { #endif #ifdef CONFIG_SCHED_CACHE .preferred_llc =3D -1, + .pref_llc_queued =3D 0, #endif #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS) .kasan_depth =3D 1, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 73f185ba6e48..9e6edd40cd80 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1472,15 +1472,32 @@ static bool invalid_llc_nr(struct mm_struct *mm, st= ruct task_struct *p, =20 static void account_llc_enqueue(struct rq *rq, struct task_struct *p) { + int pref_llc, pref_llc_queued; struct sched_domain *sd; - int pref_llc; =20 pref_llc =3D p->preferred_llc; if (pref_llc < 0) return; =20 + pref_llc_queued =3D (pref_llc =3D=3D task_llc(p)); rq->nr_llc_running++; - rq->nr_pref_llc_running +=3D (pref_llc =3D=3D task_llc(p)); + rq->nr_pref_llc_running +=3D pref_llc_queued; + + /* + * Record whether p is enqueued on its preferred + * LLC, in order to pair with account_llc_dequeue() + * to maintain a consistent nr_pref_llc_running per + * runqueue. + * This is necessary because a race condition exists: + * after a task is enqueued on a runqueue, task_llc(p) + * may change due to CPU hotplug. Therefore, checking + * task_llc(p) to determine whether the task is being + * dequeued from its preferred LLC is unreliable and + * can cause inconsistent values - checking the + * p->pref_llc_queued in account_llc_dequeue() would + * be reliable. + */ + p->pref_llc_queued =3D pref_llc_queued; =20 sd =3D rcu_dereference_all(rq->sd); if (sd && (unsigned int)pref_llc < sd->llc_max) @@ -1497,7 +1514,15 @@ static void account_llc_dequeue(struct rq *rq, struc= t task_struct *p) return; =20 rq->nr_llc_running--; - rq->nr_pref_llc_running -=3D (pref_llc =3D=3D task_llc(p)); + if (p->pref_llc_queued) { + rq->nr_pref_llc_running--; + /* + * Update the status in case + * other logic might query + * this. + */ + p->pref_llc_queued =3D 0; + } =20 sd =3D rcu_dereference_all(rq->sd); if (sd && (unsigned int)pref_llc < sd->llc_max) { --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B5FEF368D66 for ; Wed, 13 May 2026 20:33:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704432; cv=none; b=RZdfwgkAMd0J0wtb4p3LDpp/B/KJ8nbHlqaB5wilcHRjN4rnK2LJ5Q9mC9Y25TlIR5JOtmwGHjyl8ibGcX2BsIztVR6atI597XuWQRyI/BLqnkYpKfAcWp+Wprgnf3yny7VoLptVVcAETfSl7udeq/wuG6FYG1k8ey246cP8NE0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704432; c=relaxed/simple; bh=UGo6DN50Uv/tRUerMamkZ/Y7eGfcGZqAsqtg5pBXdUo=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=tWaC3iIZ3bWg52+XSoyq+uR9J7v5btGiCpDW+Cg2eu8mplC71kK081orr4kP35Lgqb8+TG8C0cY+tNZ697GOVTbP4d1Vt+Z57KUb10dir8QSA+lsuSHXI69dVYh7V3c2U+TyuVmEeRcL4mA/bnhGikjWsQY31/4OPeC0rEee+pE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=SXyVODbl; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="SXyVODbl" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704431; x=1810240431; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=UGo6DN50Uv/tRUerMamkZ/Y7eGfcGZqAsqtg5pBXdUo=; b=SXyVODblH1seiWkaEWmDeGSwisRQpwjwuounnu/NxGcwUFkptEIiw3EL ERSzljRetjiuM5eVVRFs5qN7S1Wa0bqXcFpicqYbmiyzudDfnSH7pVb63 GmyeJ/lM63WOiXSPDmQFXfE5JCseekmuLAIB5ALFcUeDv8lfwOddm5NBG GhjYct8t9lp7Umx44a9KUH9leNxspYaMJcYw1fG1HYk0yi0d8b4mfIeIa mqkM1noQM8q4N4Q2R43oLNzQ79lIyvdwXePjFdtan3CEezzz7wkoYFG9T tWMakTLyjcsTrBFABg/gbJgg14ba6HkY4+I6Q5RifHWV/JrKyTvKpft2o A==; X-CSE-ConnectionGUID: Y4K4TgKESPSYposRryIQOw== X-CSE-MsgGUID: Iy41SqAWTruo1d1RFJCang== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623199" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623199" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:44 -0700 X-CSE-ConnectionGUID: g2LzkM4+RWySZsJCkY5Tbg== X-CSE-MsgGUID: UIXm4gRrQjWfkBmKQJpgDw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076392" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:44 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 11/16] sched/cache: Fix checking active load balance by only considering the CFS task Date: Wed, 13 May 2026 13:39:22 -0700 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu The currently running task cur may not be a CFS task, such as an RT or Deadline task. For non-CFS tasks, the task_util(cur) utilization average is not maintained, so this might pass a stale or meaningless value to can_migrate_llc(). Check if the task is CFS before getting its task_util(). This bug was reported by sashiko. Fixes: 714059f79ff0 ("sched/cache: Handle moving single tasks to/from their= preferred LLC") Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- kernel/sched/fair.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9e6edd40cd80..8617cd3642c7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -10509,7 +10509,8 @@ alb_break_llc(struct lb_env *env) /* * All tasks prefer to stay on their current CPU. * Do not pull a task from its preferred CPU if: - * 1. It is the only task running there(not too imbalance); OR + * 1. It is the only task running and does not exceed + * imbalance allowance; OR * 2. Migrating it away from its preferred LLC would violate * the cache-aware scheduling policy. */ @@ -10522,7 +10523,7 @@ alb_break_llc(struct lb_env *env) return true; =20 cur =3D rcu_dereference_all(env->src_rq->curr); - if (cur) + if (cur && cur->sched_class =3D=3D &fair_sched_class) util =3D task_util(cur); =20 if (can_migrate_llc(env->src_cpu, env->dst_cpu, --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B14992E0901 for ; Wed, 13 May 2026 20:33:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704433; cv=none; b=Su66mD6x8WCs99MgCcfVMOfZBASSVMTIl4B8JwJbuOEBUB4tTrLH1vOIbeYqzHaX/0NM0l5fqVmQAv+RbDOd9txljQVCoGpQ9GqelFDzz60b0JY83Vz7ueB2vMy6owlx0ylzuRceaCkUDHS8+D1xzG2td00TYphDKViPODL53ZA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704433; c=relaxed/simple; bh=XbuGDa7JJ7FVqD5xSNQStR6gz6ySjrTGgL8YslZ2MOY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=UZQAKbN8XqLNn1ndZiyTc4LaleobIPyv4bshd2iLlTSeoVh5Rxl6p1wUd61SJqyVlCY2U6mBQUXow0fhFb/dBKafV9mMy0WpmFuJDYs/gKY5qrD6rGx81ocxPJHFBRyF/ngeIm5wHk79FOIb/6lScuxqYHxO34xBo1NF3w700Kw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=oHS+MsNU; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="oHS+MsNU" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704432; x=1810240432; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=XbuGDa7JJ7FVqD5xSNQStR6gz6ySjrTGgL8YslZ2MOY=; b=oHS+MsNUjJ9M+r1YZv72Ts0PFqtf4nvzd/3kp1mA60atqXtFcZAZQuwb zDnoFa2Tsq62xNHX6Sm1HMSrILzMoHZEeY8cWPvwXpOe5Kf75LU5NSfMv T3J5X4I6lnlrQP9/6oF/kDCNQpcitb3Mhd+J+LacRdplDlCZ7Yn9LBiAF pkCJIakapVjp4z/EiMuyaeT7a0Emv1MrWJOBZ5rfyXd1aD658UrJciB3g rzWoUN7NptTDwYFu3QnwC7SeDH2Hr/tlk0Z7zTRObX2pE/JLT6b4eKhDA whLQByHOSdSzDuXNnEKmp8xGeXtEanB/i208bxoBCnkBOgRHKkW1RjuUl Q==; X-CSE-ConnectionGUID: 0OZa9K75SseKqH4UFhsYjA== X-CSE-MsgGUID: XDVtHtarRHmT2f/GVmrVUw== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623220" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623220" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:46 -0700 X-CSE-ConnectionGUID: 2uFbUuxzT8qEkaBirVibpg== X-CSE-MsgGUID: 7+U8vwYWS1qQxb+xWNKJpw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076397" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:45 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 12/16] sched/cache: Fix race condition during sched domain rebuild Date: Wed, 13 May 2026 13:39:23 -0700 Message-Id: <9afddf439687f04bb56b46625bd9f153eb8abad5.1778703694.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu sched_cache_active_set_unlocked() checks hardware support without locks: static void sched_cache_active_set(bool locked) { /* hardware does not support */ if (!static_branch_likely(&sched_cache_present)) { _sched_cache_active_set(false, locked); return; } ... If build_sched_domains() runs concurrently during CPU hotplug, it can disable sched_cache_present under sched_domains_mutex and the CPU hotplug lock. If a debugfs write thread evaluates sched_cache_present as true right before that, and then blocks or gets preempted, it might proceed to enable sched_cache_active after the hardware support has been marked as absent. Make it safer by acquiring cpus_read_lock() and sched_domains_mutex_lock() when the user changes sched_cache_active via debugfs. This bug was reported by sashiko. Fixes: 067a31358143 ("sched/cache: Allow the user space to turn on and off = cache aware scheduling") Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- kernel/sched/debug.c | 4 +++- kernel/sched/sched.h | 2 +- kernel/sched/topology.c | 42 +++++++++++++++-------------------------- 3 files changed, 19 insertions(+), 29 deletions(-) diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index fe569539e888..ed3a0d65da0c 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -224,7 +224,9 @@ sched_cache_enable_write(struct file *filp, const char = __user *ubuf, =20 sysctl_sched_cache_user =3D val; =20 - sched_cache_active_set_unlocked(); + sched_cache_active_set(); + + *ppos +=3D cnt; =20 return cnt; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 27409399137c..45a3b77f46aa 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -4083,7 +4083,7 @@ static inline bool sched_cache_enabled(void) return static_branch_unlikely(&sched_cache_active); } =20 -extern void sched_cache_active_set_unlocked(void); +extern void sched_cache_active_set(void); =20 #endif =20 diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 7248a7279abe..cff5a0ecd64d 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -917,30 +917,19 @@ static bool alloc_sd_llc(const struct cpumask *cpu_ma= p, return false; } =20 -static void _sched_cache_active_set(bool enable, bool locked) -{ - if (enable) { - if (locked) - static_branch_enable_cpuslocked(&sched_cache_active); - else - static_branch_enable(&sched_cache_active); - } else { - if (locked) - static_branch_disable_cpuslocked(&sched_cache_active); - else - static_branch_disable(&sched_cache_active); - } -} - /* * Enable/disable cache aware scheduling according to * user input and the presence of hardware support. + * Expected to be protected by cpus_read_lock() and + * sched_domains_mutex_lock() */ -static void sched_cache_active_set(bool locked) +static void _sched_cache_active_set(void) { /* hardware does not support */ if (!static_branch_likely(&sched_cache_present)) { - _sched_cache_active_set(false, locked); + static_branch_disable_cpuslocked(&sched_cache_active); + if (sched_debug()) + pr_info("%s: cache aware scheduling not supported on this platform\n", = __func__); return; } =20 @@ -951,24 +940,23 @@ static void sched_cache_active_set(bool locked) * for now. */ if (sysctl_sched_cache_user) { - _sched_cache_active_set(true, locked); + static_branch_enable_cpuslocked(&sched_cache_active); if (sched_debug()) pr_info("%s: enabling cache aware scheduling\n", __func__); } else { - _sched_cache_active_set(false, locked); + static_branch_disable_cpuslocked(&sched_cache_active); if (sched_debug()) pr_info("%s: disabling cache aware scheduling\n", __func__); } } =20 -static void sched_cache_active_set_locked(void) -{ - return sched_cache_active_set(true); -} - -void sched_cache_active_set_unlocked(void) +void sched_cache_active_set(void) { - return sched_cache_active_set(false); + cpus_read_lock(); + sched_domains_mutex_lock(); + _sched_cache_active_set(); + sched_domains_mutex_unlock(); + cpus_read_unlock(); } =20 /* @@ -3082,7 +3070,7 @@ build_sched_domains(const struct cpumask *cpu_map, st= ruct sched_domain_attr *att else static_branch_disable_cpuslocked(&sched_cache_present); =20 - sched_cache_active_set_locked(); + _sched_cache_active_set(); #endif __free_domain_allocs(&d, alloc_state, cpu_map); =20 --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 33D23374E71 for ; Wed, 13 May 2026 20:33:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704435; cv=none; b=ghgDAI9ryHDy5oqElfCu9DtSHTOi+QJEi40GlGxRyZWcQ3BOauuqGXfcBkpiyPXWrr92BqCrSTMtgU78fWMnjq147YUhAWCsUh9+fN/vWuCQ6yD9KuVdkBihtCrz3ajW6ixoe1d6iVFz5OeBHkh+s7JI2j5ISoPDNgWHTCws7PI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704435; c=relaxed/simple; bh=/VmgYErXZx6IcFJgtwwENbrpk1uKBU0gJ+0pKw7HLsI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=cMHdsaZBk6xd+lKdzRL5Tm6hoCkY+LAsY/a8dKEjVOeLssngxmShjzHJKuOL7fH1FTqt95huJNwWuQSV4n4H0epAyRmRnxSIG1iYPugASDiX7kmFyCZZmmABT3Ynrp/FRiwv4elyQ0jErm1aZ53dzIlkkHBjlbp3x5J7eUTV3dQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=HxtzzN6y; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="HxtzzN6y" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704432; x=1810240432; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=/VmgYErXZx6IcFJgtwwENbrpk1uKBU0gJ+0pKw7HLsI=; b=HxtzzN6yYMv4gJahnVnBCqIyny1GqUc6DsH6j7hsPtUV2f+SXqpdgqUn 8FTVgTpv6UM3Kre7LsoyFzw1rLWJuhZmvBoCfimafTGx8u4mVL1W/oovX 4fSqDGLfHy4+nmcSIGgZ0QfgkzNCQXQYA9FlUW58TandMVwN+E9fmj+Bv mzxvoMWw+Nd3oLi/HvvABSJnBAYma2T16cG+1lrTQ/FO3hRchx7suiUPt vCZMByTtixFxvyBKu0QTm/BfIr0gB4jvzUaI0+PVUU+EXMXQoRVkykWBW jJ1FbYz2kFqz89k5fQ8iPDNTRvpwuuUuypDG1vmLRSRsni+0K/ugGpwqi Q==; X-CSE-ConnectionGUID: fPIq7/RCT7C1BrdnU/01Fw== X-CSE-MsgGUID: U0rh/NKoQq6JD0ccyTCpRg== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623246" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623246" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:47 -0700 X-CSE-ConnectionGUID: dfLlWYxpQoK1Dh6kQ0pFrg== X-CSE-MsgGUID: 2+WaR7QYR+iL6KLdW11vBA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076400" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:46 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 13/16] sched/cache: Fix cache aware scheduling enabling for multi LLCs system Date: Wed, 13 May 2026 13:39:24 -0700 Message-Id: <6328a8a7f40925cec2a712d81ee58128a4c4444a.1778703694.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu If there are multiple LLCs in the system, cache aware scheduling should be enabled. However, there is a corner case where, if there is a single NUMA node and a single LLC per node, cache aware scheduling will be turned on in the current implementation - because at this moment, the parent domain has not yet been degenerated, and it is possible that the current domain has the same cpu span as its parent. There is no need to turn cache aware scheduling on in this scenario. Fix it by iterating the parent domains to find a domain that is a superset of the current sd_llc, so that later, after the duplicated parent domains have been degenerated, cache aware scheduling will take effect. For example, the expected behavior would be: 2 sockets, 1 LLC per socket: MC span=3D0-3, PKG span=3D0-7, has_multi_llcs= =3Dtrue 1 socket, 2 LLCs per socket: MC span=3D0-3, PKG span=3D0-7, has_multi_llcs= =3Dtrue 2 sockets, 2 LLCs per socket: MC span=3D0-3, PKG span=3D0-7, has_multi_llcs= =3Dtrue 1 socket, 1 LLC per socket: MC span=3D0-3, PKG span=3D0-3, has_multi_llcs= =3Dfalse This bug was reported by sashiko. Fixes: d59f4fd1d303 ("sched/cache: Enable cache aware scheduling for multi LLCs NUMA node") Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- kernel/sched/topology.c | 39 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 36 insertions(+), 3 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index cff5a0ecd64d..07f0a3d28253 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -1007,6 +1007,37 @@ static bool alloc_sd_llc(const struct cpumask *cpu_m= ap, } #endif =20 +/* + * Return true if @sd belongs to an LLC group whose enclosing + * partition spans more than one LLC. @sd must be the topmost + * SD_SHARE_LLC domain. + * + * Any duplicated parent domains with the same span as @sd are + * skipped: before cpu_attach_domain() degeneration these still + * exist, after degeneration the loop is a no-op. This makes the + * helper usable both during sched domain build and against an + * already-attached domain tree. + * + * Note: For systems with a single LLC per node, cache-aware + * scheduling is still enabled when multiple nodes exist. + * However, NUMA balancing decisions take precedence over + * cache-aware scheduling. Conversely, if there is only one + * LLC per partition, cache-aware scheduling should be disabled. + */ +static bool sd_in_multi_llcs(struct sched_domain *sd) +{ + struct sched_domain *sdp =3D sd->parent; + + /* it does not make sense to aggregate to 1 CPU */ + if (sd->span_weight =3D=3D 1) + return false; + + while (sdp && sdp->span_weight =3D=3D sd->span_weight) + sdp =3D sdp->parent; + + return !!sdp; +} + /* * Return the canonical balance CPU for this group, this is the first CPU * of this group that's also in the balance mask. @@ -3016,9 +3047,11 @@ build_sched_domains(const struct cpumask *cpu_map, s= truct sched_domain_attr *att * NUMA imbalance stats for the hierarchy. */ if (sd->parent) { - if (IS_ENABLED(CONFIG_NUMA)) - adjust_numa_imbalance(sd); - has_multi_llcs =3D true; + if (IS_ENABLED(CONFIG_NUMA)) + adjust_numa_imbalance(sd); + + if (sd_in_multi_llcs(sd)) + has_multi_llcs =3D true; } } } --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A5803815FE for ; Wed, 13 May 2026 20:33:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704435; cv=none; b=AcATnOnej1q4kulz77PdgRHGEXSYduHsCsYQHDLbLHbaoaFtJQOFbG0kI+Dcsob81koT55Gu10CnH1MieNw502QvAAWoC+nN50PK1OTaH0SYqO3AXNBSgdTOkuscC03F9cEA2AuemBlLazUMitNwnEKuzoM3PPf418UNplMz/JI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704435; c=relaxed/simple; bh=QlxgnIL3wX/PxcACVKd44HmjipBp71zolANDvZekoIE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=RRmDUhaeb+jVz4din2znAuOM9lT8XQKea3WvkRs136eF2v+ReJRI6wisVngFyAJWEmxAT1hOKHLnJqSgHLLHa1OhDqBbMXwW3s8k183ufS2ts2raefPO7giVE+09FANleSK/Yn/KgSyTmDgHnBbau1fkrw+o4PjSTZmy74LFPH4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=n31hyawj; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="n31hyawj" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704434; x=1810240434; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=QlxgnIL3wX/PxcACVKd44HmjipBp71zolANDvZekoIE=; b=n31hyawjCEyjnQeSIdB3zMylOziu/2D0ivrATt57OcbTmafOPJ/xLF1E E77gRxXkkIeZdbBo/zQ38RdLnHxiVT6KzESTVfZcy5aWuVPVOW7WpXeNL ti4blsyTfNSCdXJYCoSleyqB0+M/rfdUF5+VmCfojiVGjn4uEoIfEQi/c ObYj75OoVg5m2kYUrqe/pqJ+kCgYWo3nBjodLbmNWxxw5+oqlg+x7X7CI pK65ZbF/AcqBrGXJp0IhZIV/+fi/aXeqibNzHfGrmvxn5C3nwcQunJzeL ZHyzpt+MELUfpwWn58Z88tqkqru0+z0/nahEOj46zUah5W0qkyQaIRwlX g==; X-CSE-ConnectionGUID: 8Hn54XJyQW+XYt7Q68vdWA== X-CSE-MsgGUID: rIT/2kM6R2uYuG2Lo9GmtQ== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623268" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623268" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:48 -0700 X-CSE-ConnectionGUID: UIhc3xJlSWSRdmR/Hmdb5w== X-CSE-MsgGUID: IbITKe/+TgC93+xn0le14Q== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076404" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:47 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 14/16] sched/cache: Fix has_multi_llcs iff at least one partition has multiple LLCs Date: Wed, 13 May 2026 13:39:25 -0700 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu sched_cache_present is a global static key, but build_sched_domains() is called per partition from the "Build new domains" loop in partition_sched_domains_locked(). Each call unconditionally sets the key based solely on the has_multi_llcs local variable for that partition. The call to the last partition set the value even when there are previous partitions with multiple LLCs. If partition A (multi-LLC) is built first, the key is enabled. Then when partition B (single-LLC) is built, the key is disabled. The multi-LLC partition A is still active but the key is now off. Fix it by doing a similar thing as sched_energy_present: check the multi-LLCs during the iteration over all the partitions rather than checking it on a single partition. This bug was reported by sashiko. Fixes: d59f4fd1d303 ("sched/cache: Enable cache aware scheduling for multi LLCs NUMA node") Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- kernel/sched/topology.c | 69 +++++++++++++++++++++++++++++++---------- 1 file changed, 53 insertions(+), 16 deletions(-) diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 07f0a3d28253..4c5ea369d835 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -950,6 +950,7 @@ static void _sched_cache_active_set(void) } } =20 +/* used by debugfs */ void sched_cache_active_set(void) { cpus_read_lock(); @@ -999,12 +1000,27 @@ void sched_update_llc_bytes(unsigned int cpu) unlock: sched_domains_mutex_unlock(); } + +static void sched_cache_set(bool has_multi_llcs) +{ + /* + * TBD: check before writing to it. sched domain rebuild + * is not in the critical path, leave as-is for now. + */ + if (has_multi_llcs) + static_branch_enable_cpuslocked(&sched_cache_present); + else + static_branch_disable_cpuslocked(&sched_cache_present); + + _sched_cache_active_set(); +} #else static bool alloc_sd_llc(const struct cpumask *cpu_map, struct s_data *d) { return false; } +static inline void sched_cache_set(bool has_multi_llcs) { } #endif =20 /* @@ -2949,7 +2965,8 @@ void sched_domains_free_llc_id(int cpu) * to the individual CPUs */ static int -build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_att= r *attr) +build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_att= r *attr, + bool *multi_llcs) { enum s_alloc alloc_state =3D sa_none; bool has_multi_llcs =3D false; @@ -3093,18 +3110,7 @@ build_sched_domains(const struct cpumask *cpu_map, s= truct sched_domain_attr *att =20 ret =3D 0; error: -#ifdef CONFIG_SCHED_CACHE - /* - * TBD: check before writing to it. sched domain rebuild - * is not in the critical path, leave as-is for now. - */ - if (!ret && has_multi_llcs) - static_branch_enable_cpuslocked(&sched_cache_present); - else - static_branch_disable_cpuslocked(&sched_cache_present); - - _sched_cache_active_set(); -#endif + *multi_llcs =3D has_multi_llcs; __free_domain_allocs(&d, alloc_state, cpu_map); =20 return ret; @@ -3167,6 +3173,7 @@ void free_sched_domains(cpumask_var_t doms[], unsigne= d int ndoms) */ int __init sched_init_domains(const struct cpumask *cpu_map) { + bool multi_llcs; int err; =20 zalloc_cpumask_var(&sched_domains_llc_id_allocmask, GFP_KERNEL); @@ -3181,7 +3188,9 @@ int __init sched_init_domains(const struct cpumask *c= pu_map) if (!doms_cur) doms_cur =3D &fallback_doms; cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_TYPE_DOMAIN)); - err =3D build_sched_domains(doms_cur[0], NULL); + err =3D build_sched_domains(doms_cur[0], NULL, &multi_llcs); + if (!err) + sched_cache_set(multi_llcs); =20 return err; } @@ -3254,6 +3263,7 @@ static void partition_sched_domains_locked(int ndoms_= new, cpumask_var_t doms_new struct sched_domain_attr *dattr_new) { bool __maybe_unused has_eas =3D false; + bool has_multi_llcs =3D false, multi_llcs; int i, j, n; int new_topology; =20 @@ -3303,14 +3313,41 @@ static void partition_sched_domains_locked(int ndom= s_new, cpumask_var_t doms_new for (i =3D 0; i < ndoms_new; i++) { for (j =3D 0; j < n && !new_topology; j++) { if (cpumask_equal(doms_new[i], doms_cur[j]) && - dattrs_equal(dattr_new, i, dattr_cur, j)) + dattrs_equal(dattr_new, i, dattr_cur, j)) { + /* + * Reused partition has to be taken care + * of here, because there could be a corner + * case that if the reused partition is skipped + * and only new partition is considered, an + * incorrect has_multi_llcs would be set. For + * example: + * If the only multi-LLC partition is reused + * and a new single-LLC partition is built, + * sched_cache_set(false) disables cache-aware + * scheduling globally despite the reused + * multi-LLC partition still being active. + */ + struct sched_domain *sd; + int cpu =3D cpumask_first(doms_cur[j]); + + guard(rcu)(); + sd =3D rcu_dereference(cpu_rq(cpu)->sd); + while (sd && sd->parent && (sd->parent->flags & SD_SHARE_LLC)) + sd =3D sd->parent; + if (sd && (sd->flags & SD_SHARE_LLC) && sd->parent && + sd_in_multi_llcs(sd)) + has_multi_llcs =3D true; goto match2; + } } /* No match - add a new doms_new */ - build_sched_domains(doms_new[i], dattr_new ? dattr_new + i : NULL); + build_sched_domains(doms_new[i], dattr_new ? dattr_new + i : NULL, + &multi_llcs); + has_multi_llcs |=3D multi_llcs; match2: ; } + sched_cache_set(has_multi_llcs); =20 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) /* Build perf domains: */ --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C9077360EED for ; Wed, 13 May 2026 20:33:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704435; cv=none; b=O5FNa5hCMA6Kwglh7mWRdiVljBWi+1VPZakMOodIz5vPHbFIsKl5OVR4g9NDrBieWfsmu57yunMMni0UDISLI1/T7hdm6jKJdg0/qwByrJiNw6LzkGNmmopP9TgmotsA9oFqLmolcAESMMbWBaFPi9b55TCcvoeeomaZ7zm9+JA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704435; c=relaxed/simple; bh=t1FwK5H8eA5XceraosnpeVkGplWnliav65lNNIbDDfU=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=qll82RvDi6W+m6eZBwh+Jw/k+zZZFZx4CHmwexgD9MvflV+bSViGh3Bo2qS+Pto41JI3yGuHJkiHjq2jMYcByxPm0ANWmP8tcbWnnING202ruEeOk0XQFe7qvY8wysDEoC4e8ebmc8P/BXutfVXFGuLhvgUQi9Wuc22EzGbm+p4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=i0LZbcXM; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="i0LZbcXM" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704434; x=1810240434; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=t1FwK5H8eA5XceraosnpeVkGplWnliav65lNNIbDDfU=; b=i0LZbcXMHI8g/H89E89jOLAe2N5P3uMI7eZV+dXc06+TfeHo+fdjN4i5 GyjEJUr9XC7ZzRRRJc4wfSb8jA6qmm1Ptwx219yDHjs41fTv/8+EwbB4m ewbIsAmWSLEj6iLRDKOaL2PjWILspb//uIeqRoyBSuE1UlB3UmrC12osI Pa0oNGDxzUTe0Q9IqsEvMad/Jl30GxtJmj/vrxmUTVOh4MtvbitY9YulT NSG9kBkbYGHU0usp5mY+U2WoQ9O8xMQk761dI+EYlGe/POnS8UqH2CI53 fcOOUvUvHxKmSEWEClgNKUHe5/BOo4YMoZLhfITMGrBev49mGahVHSIGR w==; X-CSE-ConnectionGUID: nImrx8VQSEyknvWWV0vYhw== X-CSE-MsgGUID: L7hL1qc7SWuUGXZdmoqH6g== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623292" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623292" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:49 -0700 X-CSE-ConnectionGUID: ww2XjdZVQwGP/xMBgK9hIw== X-CSE-MsgGUID: 7sZq6untS8OBizL5JkqDVg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076408" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:48 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 15/16] sched/cache: Fix possible overflow when invalidating the preferred CPU Date: Wed, 13 May 2026 13:39:26 -0700 Message-Id: X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu epoch comes from the local rq->cpu_epoch, but mm->sc_stat.epoch is written by task_tick_cache() running on any CPU - potentially a different CPU whose rq->cpu_epoch is further ahead. The unsigned underflow wraps to a huge numb= er, so the condition fires incorrectly. Fix this by converting the result to long. Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware= load balancing") Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8617cd3642c7..7e64cd18727e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1688,7 +1688,7 @@ void account_mm_sched(struct rq *rq, struct task_stru= ct *p, s64 delta_exec) * If this process hasn't hit task_cache_work() for a while invalidate * its preferred state. */ - if (epoch - READ_ONCE(mm->sc_stat.epoch) > llc_epoch_affinity_timeout || + if ((long)(epoch - READ_ONCE(mm->sc_stat.epoch)) > (long)llc_epoch_affini= ty_timeout || invalid_llc_nr(mm, p, cpu_of(rq)) || exceed_llc_capacity(mm, cpu_of(rq))) { if (READ_ONCE(mm->sc_stat.cpu) !=3D -1) --=20 2.32.0 From nobody Fri Jun 12 15:46:56 2026 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 74A8E3955F3 for ; Wed, 13 May 2026 20:33:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704443; cv=none; b=BJrRr0Xky1aNGzxswGos20DZEEHt281hTIYg9J2DRcJbOLoLZGwW+K9Yr/IYbYlZNegXqtpUZkFvcvMLtVP8lKCQ3qg2tgAO7sZMhiG2wIuq3grdTfmD/jJBgkiL/Ig2j+WazeI3V6SBWnhhU6zgvjPKTfDv/UOjfwPD0WOFE1Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778704443; c=relaxed/simple; bh=LeKY5OSGAFQHBTheCebjoiSuXh/qiznyE2rafuiDcYw=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=QhxHfofduPQAkoaV9niUxesWLiiUWdMLFVYMz7EKCKPcpbpxLR/3du0oHpgwXbZWTDQUVvNaGuVrhGfHBwajOXT/dM4+bnXg1lV9KQvVz46X9iCz1XbKQlJ74c3tk9nn7NJdhw4Zr7i6Uw4LzNxidT5nUvto8d2PKZM8TyaxTMM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=pass smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=csQXKvMv; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="csQXKvMv" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1778704436; x=1810240436; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=LeKY5OSGAFQHBTheCebjoiSuXh/qiznyE2rafuiDcYw=; b=csQXKvMvmMuUz8TLTse9EX4mJtBJauaFn/g9ugXIwzCHdktO3dxSduVi hpSsvnpNd1LZEC3HDwa+iuMydO8BuQ6vMWrCF+XnFoKfQ8iDj/nl+1iFw /rquPd9Ywmy0JvBuq+U0E1bm24Ql76gENuxOzaKYl+mO1JuvQvjqMjpUz H3wSpfVnQJ14Hat7nmiGWt0TyNnkL2kKjWQGCLJizAxXL/jtXBMS4jzt5 n/ou9f8sa8egayIgyUmohFpsofMr5hjSZCIiGoE6u5p0YqcaGZm+krebl WVoOi7R7G6jOajV3POEcNtX7ootZPYWRorIjCY+VCFoFvI+qg35tS/fpe g==; X-CSE-ConnectionGUID: QK1cemjYRhOV+auh2VIN9w== X-CSE-MsgGUID: aQCTKQ6rRsyk/RkU3jUmqA== X-IronPort-AV: E=McAfee;i="6800,10657,11785"; a="79623313" X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="79623313" Received: from orviesa008.jf.intel.com ([10.64.159.148]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2026 13:33:50 -0700 X-CSE-ConnectionGUID: /TZAYCgTSkmniKeQefjO/Q== X-CSE-MsgGUID: LdA4D9VaQPai4TkFq4gShQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.23,233,1770624000"; d="scan'208";a="238076412" Received: from b04f130c83f2.jf.intel.com ([10.165.154.98]) by orviesa008.jf.intel.com with ESMTP; 13 May 2026 13:33:50 -0700 From: Tim Chen To: Peter Zijlstra , Ingo Molnar , K Prateek Nayak , Vincent Guittot Cc: Chen Yu , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Madadi Vineeth Reddy , Hillf Danton , Shrikanth Hegde , Jianyong Wu , Yangyu Chen , Tingyin Duan , Vern Hao , Vern Hao , Len Brown , Tim Chen , Aubrey Li , Zhao Liu , Chen Yu , Adam Li , Aaron Lu , Tim Chen , Josh Don , Gavin Guo , Qais Yousef , Libo Chen , Luo Gengkun , linux-kernel@vger.kernel.org Subject: [Patch v4 16/16] sched/cache: Fix stale preferred_llc for a new task Date: Wed, 13 May 2026 13:39:27 -0700 Message-Id: <0ec7309d0e24ede97656754d1505b7490403d966.1778703694.git.tim.c.chen@linux.intel.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Chen Yu On fork without CLONE_VM, the child gets a new mm, the parent's preferred_llc value is stale for the child. Fix this by resetting the task's preferred_llc to -1. This bug was reported by sashiko. Fixes: 47d8696b95f7 ("sched/cache: Assign preferred LLC ID to processes") Signed-off-by: Chen Yu Co-developed-by: Tim Chen Signed-off-by: Tim Chen --- kernel/sched/fair.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7e64cd18727e..73da6f8fc9ec 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1914,6 +1914,11 @@ void init_sched_mm(struct task_struct *p) =20 init_task_work(work, task_cache_work); work->next =3D work; + /* + * Reset new task's preference to avoid + * polluting account_llc_enqueue(). + */ + p->preferred_llc =3D -1; } =20 #else /* CONFIG_SCHED_CACHE */ --=20 2.32.0