[PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing

Tim Chen posted 23 patches 2 weeks, 1 day ago
There is a newer version of this series
[PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing
Posted by Tim Chen 2 weeks, 1 day ago
From: Chen Yu <yu.c.chen@intel.com>

Debug patch only.

With cache-aware load balancing enabled, statistics related to its activity
are exposed via /proc/schedstat and debugfs. For instance, if users want to
verify metrics like the number of exceeding RSS and nr_running limits, they
can filter the output of /sys/kernel/debug/sched/debug and compute the required
statistics manually:

llc_exceed_cap SUM: 6
llc_exceed_nr SUM: 4531

Furthermore, these statistics exposed in /proc/schedstats can be queried manually
or via perf sched stats[1] with minor modifications.

Link: https://lore.kernel.org/all/20250909114227.58802-1-swapnil.sapkal@amd.com #1

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 include/linux/sched/topology.h | 1 +
 kernel/sched/fair.c            | 1 +
 kernel/sched/stats.c           | 5 +++--
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 0ba4697d74ba..8702c1e731a0 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -108,6 +108,7 @@ struct sched_domain {
 	unsigned int lb_imbalance_util[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_imbalance_task[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_imbalance_misfit[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_imbalance_llc[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a2e2d6742481..742e455b093e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12684,6 +12684,7 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
 		__schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
 		break;
 	case migrate_llc_task:
+		__schedstat_add(sd->lb_imbalance_llc[idle], env->imbalance);
 		break;
 	}
 }
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index d1c9429a4ac5..3736f6102261 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -104,7 +104,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
  * Bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 17
+#define SCHEDSTAT_VERSION 18
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -139,7 +139,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 			seq_printf(seq, "domain%d %s %*pb", dcount++, sd->name,
 				   cpumask_pr_args(sched_domain_span(sd)));
 			for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
-				seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u",
+				seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u %u",
 				    sd->lb_count[itype],
 				    sd->lb_balanced[itype],
 				    sd->lb_failed[itype],
@@ -147,6 +147,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 				    sd->lb_imbalance_util[itype],
 				    sd->lb_imbalance_task[itype],
 				    sd->lb_imbalance_misfit[itype],
+				    sd->lb_imbalance_llc[itype],
 				    sd->lb_gained[itype],
 				    sd->lb_hot_gained[itype],
 				    sd->lb_nobusyq[itype],
-- 
2.32.0
Re: [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing
Posted by Yangyu Chen 14 hours ago
> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> From: Chen Yu <yu.c.chen@intel.com>
> 
> Debug patch only.
> 
> With cache-aware load balancing enabled, statistics related to its activity
> are exposed via /proc/schedstat and debugfs. For instance, if users want to
> verify metrics like the number of exceeding RSS and nr_running limits, they
> can filter the output of /sys/kernel/debug/sched/debug and compute the required
> statistics manually:
> 
> llc_exceed_cap SUM: 6
> llc_exceed_nr SUM: 4531
> 
> Furthermore, these statistics exposed in /proc/schedstats can be queried manually
> or via perf sched stats[1] with minor modifications.
> 

Hi Tim,

This patch looks great, especially for multithread Verilator workloads
on clustered LLC (like AMD EPYC). I'm discussing with Verilator
upstream to disable automatic userspace affinity assignment in
Verilator if such feature exist [1]. During the discussion, I think
there should be a way for userspace software to detect if such a
feature exists. Could we expose it in `/proc/schedstats` to allow
userspace software to detect such a feature? We can just use this
patch and remove the "DO NOT APPLY" tag.

[1] https://github.com/verilator/verilator/issues/6826#issuecomment-3671287551

Thanks,
Yangyu Chen

> Link: https://lore.kernel.org/all/20250909114227.58802-1-swapnil.sapkal@amd.com #1
> 
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> include/linux/sched/topology.h | 1 +
> kernel/sched/fair.c            | 1 +
> kernel/sched/stats.c           | 5 +++--
> 3 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 0ba4697d74ba..8702c1e731a0 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -108,6 +108,7 @@ struct sched_domain {
> unsigned int lb_imbalance_util[CPU_MAX_IDLE_TYPES];
> unsigned int lb_imbalance_task[CPU_MAX_IDLE_TYPES];
> unsigned int lb_imbalance_misfit[CPU_MAX_IDLE_TYPES];
> + unsigned int lb_imbalance_llc[CPU_MAX_IDLE_TYPES];
> unsigned int lb_gained[CPU_MAX_IDLE_TYPES];
> unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
> unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a2e2d6742481..742e455b093e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12684,6 +12684,7 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
> __schedstat_add(sd->lb_imbalance_misfit[idle], env->imbalance);
> break;
> case migrate_llc_task:
> + __schedstat_add(sd->lb_imbalance_llc[idle], env->imbalance);
> break;
> }
> }
> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
> index d1c9429a4ac5..3736f6102261 100644
> --- a/kernel/sched/stats.c
> +++ b/kernel/sched/stats.c
> @@ -104,7 +104,7 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
>  * Bump this up when changing the output format or the meaning of an existing
>  * format, so that tools can adapt (or abort)
>  */
> -#define SCHEDSTAT_VERSION 17
> +#define SCHEDSTAT_VERSION 18
> 
> static int show_schedstat(struct seq_file *seq, void *v)
> {
> @@ -139,7 +139,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
> seq_printf(seq, "domain%d %s %*pb", dcount++, sd->name,
>   cpumask_pr_args(sched_domain_span(sd)));
> for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
> - seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u",
> + seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u %u",
>    sd->lb_count[itype],
>    sd->lb_balanced[itype],
>    sd->lb_failed[itype],
> @@ -147,6 +147,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
>    sd->lb_imbalance_util[itype],
>    sd->lb_imbalance_task[itype],
>    sd->lb_imbalance_misfit[itype],
> +    sd->lb_imbalance_llc[itype],
>    sd->lb_gained[itype],
>    sd->lb_hot_gained[itype],
>    sd->lb_nobusyq[itype],
> -- 
> 2.32.0
Re: [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing
Posted by Chen, Yu C 4 hours ago
On 12/19/2025 1:03 PM, Yangyu Chen wrote:
> 
>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>
>> From: Chen Yu <yu.c.chen@intel.com>
>>
>> Debug patch only.
>>
>> With cache-aware load balancing enabled, statistics related to its activity
>> are exposed via /proc/schedstat and debugfs. For instance, if users want to
>> verify metrics like the number of exceeding RSS and nr_running limits, they
>> can filter the output of /sys/kernel/debug/sched/debug and compute the required
>> statistics manually:
>>
>> llc_exceed_cap SUM: 6
>> llc_exceed_nr SUM: 4531
>>
>> Furthermore, these statistics exposed in /proc/schedstats can be queried manually
>> or via perf sched stats[1] with minor modifications.
>>
> 
> Hi Tim,
> 
> This patch looks great, especially for multithread Verilator workloads
> on clustered LLC (like AMD EPYC). I'm discussing with Verilator
> upstream to disable automatic userspace affinity assignment in
> Verilator if such feature exist [1]. During the discussion, I think
> there should be a way for userspace software to detect if such a
> feature exists. Could we expose it in `/proc/schedstats` to allow
> userspace software to detect such a feature? We can just use this
> patch and remove the "DO NOT APPLY" tag.
> 

Thanks for the test Yangyu. Does /sys/kernel/debug/sched/llc_enabled
work for you?
Anyway we can try to include /proc/schedstats as the formal one
in the next version.

Thanks,
Chenyu

> [1] https://github.com/verilator/verilator/issues/6826#issuecomment-3671287551
> 
> Thanks,
> Yangyu Chen
> 

>
Re: [PATCH v2 21/23] -- DO NOT APPLY!!! -- sched/cache/stats: Add schedstat for cache aware load balancing
Posted by Yangyu Chen 4 hours ago

> On 19 Dec 2025, at 22:41, Chen, Yu C <yu.c.chen@intel.com> wrote:
> 
> On 12/19/2025 1:03 PM, Yangyu Chen wrote:
>>> On 4 Dec 2025, at 07:07, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>> 
>>> From: Chen Yu <yu.c.chen@intel.com>
>>> 
>>> Debug patch only.
>>> 
>>> With cache-aware load balancing enabled, statistics related to its activity
>>> are exposed via /proc/schedstat and debugfs. For instance, if users want to
>>> verify metrics like the number of exceeding RSS and nr_running limits, they
>>> can filter the output of /sys/kernel/debug/sched/debug and compute the required
>>> statistics manually:
>>> 
>>> llc_exceed_cap SUM: 6
>>> llc_exceed_nr SUM: 4531
>>> 
>>> Furthermore, these statistics exposed in /proc/schedstats can be queried manually
>>> or via perf sched stats[1] with minor modifications.
>>> 
>> Hi Tim,
>> This patch looks great, especially for multithread Verilator workloads
>> on clustered LLC (like AMD EPYC). I'm discussing with Verilator
>> upstream to disable automatic userspace affinity assignment in
>> Verilator if such feature exist [1]. During the discussion, I think
>> there should be a way for userspace software to detect if such a
>> feature exists. Could we expose it in `/proc/schedstats` to allow
>> userspace software to detect such a feature? We can just use this
>> patch and remove the "DO NOT APPLY" tag.
> 
> Thanks for the test Yangyu. Does /sys/kernel/debug/sched/llc_enabled
> work for you?

It requires debugfs being mounted with enough permissions. It’s not
feasible for normal user-space software without root permission.

Thanks,
Yangyu Chen

> Anyway we can try to include /proc/schedstats as the formal one
> in the next version.
> 
> Thanks,
> Chenyu
> 
>> [1] https://github.com/verilator/verilator/issues/6826#issuecomment-3671287551
>> Thanks,
>> Yangyu Chen