sched: Introduce Cache aware scheduling

[RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated

Posted by Chen Yu 9 months, 3 weeks ago

It is found that when the process's preferred LLC gets saturated by too many
threads, task contention is very frequent and causes performance regression.

Save the per LLC statistics calculated by periodic load balance. The statistics
include the average utilization and the average number of runnable tasks.
The task wakeup path for cache aware scheduling manipulates these statistics
to inhibit cache aware scheduling to avoid performance regression. When either
the average utilization of the preferred LLC has reached 25%, or the average
number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware
wakeup is disabled. Only when the process has more threads than the LLC weight
will this restriction be enabled.

Running schbench via mmtests on a Xeon platform, which has 2 sockets, each socket
has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via BIOS,
so there are 2 "LLCs" in 1 NUMA node.

compare-mmtests.pl --directory work/log --benchmark schbench --names baseline,sched_cache
                                    baselin             sched_cach
                                   baseline            sched_cache
Lat 50.0th-qrtle-1          6.00 (   0.00%)        6.00 (   0.00%)
Lat 90.0th-qrtle-1         10.00 (   0.00%)        9.00 (  10.00%)
Lat 99.0th-qrtle-1         29.00 (   0.00%)       13.00 (  55.17%)
Lat 99.9th-qrtle-1         35.00 (   0.00%)       21.00 (  40.00%)
Lat 20.0th-qrtle-1        266.00 (   0.00%)      266.00 (   0.00%)
Lat 50.0th-qrtle-2          8.00 (   0.00%)        6.00 (  25.00%)
Lat 90.0th-qrtle-2         10.00 (   0.00%)       10.00 (   0.00%)
Lat 99.0th-qrtle-2         19.00 (   0.00%)       18.00 (   5.26%)
Lat 99.9th-qrtle-2         27.00 (   0.00%)       29.00 (  -7.41%)
Lat 20.0th-qrtle-2        533.00 (   0.00%)      507.00 (   4.88%)
Lat 50.0th-qrtle-4          6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
Lat 99.0th-qrtle-4         14.00 (   0.00%)        9.00 (  35.71%)
Lat 99.9th-qrtle-4         22.00 (   0.00%)       14.00 (  36.36%)
Lat 20.0th-qrtle-4       1070.00 (   0.00%)      995.00 (   7.01%)
Lat 50.0th-qrtle-8          5.00 (   0.00%)        5.00 (   0.00%)
Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
Lat 99.0th-qrtle-8         12.00 (   0.00%)       11.00 (   8.33%)
Lat 99.9th-qrtle-8         19.00 (   0.00%)       16.00 (  15.79%)
Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2140.00 (   0.00%)
Lat 50.0th-qrtle-16         6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-16         7.00 (   0.00%)        5.00 (  28.57%)
Lat 99.0th-qrtle-16        12.00 (   0.00%)       10.00 (  16.67%)
Lat 99.9th-qrtle-16        17.00 (   0.00%)       14.00 (  17.65%)
Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4200.00 (   2.23%)
Lat 50.0th-qrtle-32         6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-32         8.00 (   0.00%)        6.00 (  25.00%)
Lat 99.0th-qrtle-32        12.00 (   0.00%)       10.00 (  16.67%)
Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8528.00 (  -0.38%)
Lat 50.0th-qrtle-64         6.00 (   0.00%)        5.00 (  16.67%)
Lat 90.0th-qrtle-64         8.00 (   0.00%)        8.00 (   0.00%)
Lat 99.0th-qrtle-64        12.00 (   0.00%)       12.00 (   0.00%)
Lat 99.9th-qrtle-64        17.00 (   0.00%)       17.00 (   0.00%)
Lat 20.0th-qrtle-64     17120.00 (   0.00%)    17120.00 (   0.00%)
Lat 50.0th-qrtle-128        7.00 (   0.00%)        7.00 (   0.00%)
Lat 90.0th-qrtle-128        9.00 (   0.00%)        9.00 (   0.00%)
Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
Lat 99.9th-qrtle-128       20.00 (   0.00%)       20.00 (   0.00%)
Lat 20.0th-qrtle-128    31776.00 (   0.00%)    30496.00 (   4.03%)
Lat 50.0th-qrtle-239        9.00 (   0.00%)        9.00 (   0.00%)
Lat 90.0th-qrtle-239       14.00 (   0.00%)       18.00 ( -28.57%)
Lat 99.0th-qrtle-239       43.00 (   0.00%)       56.00 ( -30.23%)
Lat 99.9th-qrtle-239      106.00 (   0.00%)      483.00 (-355.66%)
Lat 20.0th-qrtle-239    30176.00 (   0.00%)    29984.00 (   0.64%)

We can see overall latency improvement and some throughput degradation
when the system gets saturated.

Also, we run schbench (old version) on an EPYC 7543 system, which has
4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency:

case                    load            baseline(std%)  compare%( std%)
normal                  4-mthreads-1-workers     1.00 (  6.47)   +9.02 (  4.68)
normal                  4-mthreads-2-workers     1.00 (  3.25)  +28.03 (  8.76)
normal                  4-mthreads-4-workers     1.00 (  6.67)   -4.32 (  2.58)
normal                  4-mthreads-8-workers     1.00 (  2.38)   +1.27 (  2.41)
normal                  4-mthreads-16-workers    1.00 (  5.61)   -8.48 (  4.39)
normal                  4-mthreads-31-workers    1.00 (  9.31)   -0.22 (  9.77)

When the LLC is underloaded, the latency improvement is observed. When the LLC
gets saturated, we observe some degradation.

The aggregation of tasks will move tasks towards the preferred LLC
pretty quickly during wake ups. However load balance will tend to move
tasks away from the aggregated LLC. The two migrations are in the
opposite directions and tend to bounce tasks between LLCs. Such task
migrations should be impeded in load balancing as long as the home LLC.
We're working on fixing up the load balancing path to address such issues.

Co-developed-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched/topology.h |   4 ++
 kernel/sched/fair.c            | 101 ++++++++++++++++++++++++++++++++-
 2 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 198bb5cc1774..9625d9d762f5 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -78,6 +78,10 @@ struct sched_domain_shared {
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
 	int		nr_idle_scan;
+#ifdef CONFIG_SCHED_CACHE
+	unsigned long	util_avg;
+	u64		nr_avg;
+#endif
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1733eb83042c..f74d8773c811 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8791,6 +8791,58 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 #ifdef CONFIG_SCHED_CACHE
 static long __migrate_degrades_locality(struct task_struct *p, int src_cpu, int dst_cpu, bool idle);
 
+/* expected to be protected by rcu_read_lock() */
+static bool get_llc_stats(int cpu, int *nr, int *weight, unsigned long *util)
+{
+	struct sched_domain_shared *sd_share;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+	if (!sd_share)
+		return false;
+
+	*nr = READ_ONCE(sd_share->nr_avg);
+	*util = READ_ONCE(sd_share->util_avg);
+	*weight = per_cpu(sd_llc_size, cpu);
+
+	return true;
+}
+
+static bool valid_target_cpu(int cpu, struct task_struct *p)
+{
+	int nr_running, llc_weight;
+	unsigned long util, llc_cap;
+
+	if (!get_llc_stats(cpu, &nr_running, &llc_weight,
+			   &util))
+		return false;
+
+	llc_cap = llc_weight * SCHED_CAPACITY_SCALE;
+
+	/*
+	 * If this process has many threads, be careful to avoid
+	 * task stacking on the preferred LLC, by checking the system's
+	 * utilization and runnable tasks. Otherwise, if this
+	 * process does not have many threads, honor the cache
+	 * aware wakeup.
+	 */
+	if (get_nr_threads(p) < llc_weight)
+		return true;
+
+	/*
+	 * Check if it exceeded 25% of average utiliazation,
+	 * or if it exceeded 33% of CPUs. This is a magic number
+	 * that did not cause heavy cache contention on Xeon or
+	 * Zen.
+	 */
+	if (util * 4 >= llc_cap)
+		return false;
+
+	if (nr_running * 3 >= llc_weight)
+		return false;
+
+	return true;
+}
+
 static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct mm_struct *mm = p->mm;
@@ -8813,6 +8865,9 @@ static int select_cache_cpu(struct task_struct *p, int prev_cpu)
 	if (cpus_share_cache(prev_cpu, cpu))
 		return prev_cpu;
 
+	if (!valid_target_cpu(cpu, p))
+		return prev_cpu;
+
 	if (static_branch_likely(&sched_numa_balancing) &&
 	    __migrate_degrades_locality(p, prev_cpu, cpu, false) > 0) {
 		/*
@@ -9564,7 +9619,8 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 	 */
 	if (sched_feat(SCHED_CACHE) && p->mm && p->mm->mm_sched_cpu >= 0 &&
 	    cpus_share_cache(env->src_cpu, p->mm->mm_sched_cpu) &&
-	    !cpus_share_cache(env->src_cpu, env->dst_cpu))
+	    !cpus_share_cache(env->src_cpu, env->dst_cpu) &&
+	     !valid_target_cpu(env->dst_cpu, p))
 		return 1;
 #endif
 
@@ -10634,6 +10690,48 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
 
+#ifdef CONFIG_SCHED_CACHE
+/*
+ * Save this sched group's statistic for later use:
+ * The task wakeup and load balance can make better
+ * decision based on these statistics.
+ */
+static void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats *sgs,
+			     struct sched_group *group)
+{
+	/* Find the sched domain that spans this group. */
+	struct sched_domain *sd = env->sd->child;
+	struct sched_domain_shared *sd_share;
+	u64 last_nr;
+
+	if (!sched_feat(SCHED_CACHE) || env->idle == CPU_NEWLY_IDLE)
+		return;
+
+	/* only care the sched domain that spans 1 LLC */
+	if (!sd || !(sd->flags & SD_SHARE_LLC) ||
+	    !sd->parent || (sd->parent->flags & SD_SHARE_LLC))
+		return;
+
+	sd_share = rcu_dereference(per_cpu(sd_llc_shared,
+				   cpumask_first(sched_group_span(group))));
+	if (!sd_share)
+		return;
+
+	last_nr = READ_ONCE(sd_share->nr_avg);
+	update_avg(&last_nr, sgs->sum_nr_running);
+
+	if (likely(READ_ONCE(sd_share->util_avg) != sgs->group_util))
+		WRITE_ONCE(sd_share->util_avg, sgs->group_util);
+
+	WRITE_ONCE(sd_share->nr_avg, last_nr);
+}
+#else
+static inline void update_sg_if_llc(struct lb_env *env, struct sg_lb_stats *sgs,
+				    struct sched_group *group)
+{
+}
+#endif
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10723,6 +10821,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
+	update_sg_if_llc(env, sgs, group);
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type == group_overloaded)
 		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
-- 
2.25.1

Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated

Posted by Madadi Vineeth Reddy 9 months, 2 weeks ago

Hi Chen Yu,

On 21/04/25 08:55, Chen Yu wrote:
> It is found that when the process's preferred LLC gets saturated by too many
> threads, task contention is very frequent and causes performance regression.
> 
> Save the per LLC statistics calculated by periodic load balance. The statistics
> include the average utilization and the average number of runnable tasks.
> The task wakeup path for cache aware scheduling manipulates these statistics
> to inhibit cache aware scheduling to avoid performance regression. When either
> the average utilization of the preferred LLC has reached 25%, or the average
> number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware
> wakeup is disabled. Only when the process has more threads than the LLC weight
> will this restriction be enabled.
> 
> Running schbench via mmtests on a Xeon platform, which has 2 sockets, each socket
> has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via BIOS,
> so there are 2 "LLCs" in 1 NUMA node.
> 
> compare-mmtests.pl --directory work/log --benchmark schbench --names baseline,sched_cache
>                                     baselin             sched_cach
>                                    baseline            sched_cache
> Lat 50.0th-qrtle-1          6.00 (   0.00%)        6.00 (   0.00%)
> Lat 90.0th-qrtle-1         10.00 (   0.00%)        9.00 (  10.00%)
> Lat 99.0th-qrtle-1         29.00 (   0.00%)       13.00 (  55.17%)
> Lat 99.9th-qrtle-1         35.00 (   0.00%)       21.00 (  40.00%)
> Lat 20.0th-qrtle-1        266.00 (   0.00%)      266.00 (   0.00%)
> Lat 50.0th-qrtle-2          8.00 (   0.00%)        6.00 (  25.00%)
> Lat 90.0th-qrtle-2         10.00 (   0.00%)       10.00 (   0.00%)
> Lat 99.0th-qrtle-2         19.00 (   0.00%)       18.00 (   5.26%)
> Lat 99.9th-qrtle-2         27.00 (   0.00%)       29.00 (  -7.41%)
> Lat 20.0th-qrtle-2        533.00 (   0.00%)      507.00 (   4.88%)
> Lat 50.0th-qrtle-4          6.00 (   0.00%)        5.00 (  16.67%)
> Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
> Lat 99.0th-qrtle-4         14.00 (   0.00%)        9.00 (  35.71%)
> Lat 99.9th-qrtle-4         22.00 (   0.00%)       14.00 (  36.36%)
> Lat 20.0th-qrtle-4       1070.00 (   0.00%)      995.00 (   7.01%)
> Lat 50.0th-qrtle-8          5.00 (   0.00%)        5.00 (   0.00%)
> Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
> Lat 99.0th-qrtle-8         12.00 (   0.00%)       11.00 (   8.33%)
> Lat 99.9th-qrtle-8         19.00 (   0.00%)       16.00 (  15.79%)
> Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2140.00 (   0.00%)
> Lat 50.0th-qrtle-16         6.00 (   0.00%)        5.00 (  16.67%)
> Lat 90.0th-qrtle-16         7.00 (   0.00%)        5.00 (  28.57%)
> Lat 99.0th-qrtle-16        12.00 (   0.00%)       10.00 (  16.67%)
> Lat 99.9th-qrtle-16        17.00 (   0.00%)       14.00 (  17.65%)
> Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4200.00 (   2.23%)
> Lat 50.0th-qrtle-32         6.00 (   0.00%)        5.00 (  16.67%)
> Lat 90.0th-qrtle-32         8.00 (   0.00%)        6.00 (  25.00%)
> Lat 99.0th-qrtle-32        12.00 (   0.00%)       10.00 (  16.67%)
> Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
> Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8528.00 (  -0.38%)
> Lat 50.0th-qrtle-64         6.00 (   0.00%)        5.00 (  16.67%)
> Lat 90.0th-qrtle-64         8.00 (   0.00%)        8.00 (   0.00%)
> Lat 99.0th-qrtle-64        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.9th-qrtle-64        17.00 (   0.00%)       17.00 (   0.00%)
> Lat 20.0th-qrtle-64     17120.00 (   0.00%)    17120.00 (   0.00%)
> Lat 50.0th-qrtle-128        7.00 (   0.00%)        7.00 (   0.00%)
> Lat 90.0th-qrtle-128        9.00 (   0.00%)        9.00 (   0.00%)
> Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
> Lat 99.9th-qrtle-128       20.00 (   0.00%)       20.00 (   0.00%)
> Lat 20.0th-qrtle-128    31776.00 (   0.00%)    30496.00 (   4.03%)
> Lat 50.0th-qrtle-239        9.00 (   0.00%)        9.00 (   0.00%)
> Lat 90.0th-qrtle-239       14.00 (   0.00%)       18.00 ( -28.57%)
> Lat 99.0th-qrtle-239       43.00 (   0.00%)       56.00 ( -30.23%)
> Lat 99.9th-qrtle-239      106.00 (   0.00%)      483.00 (-355.66%)
> Lat 20.0th-qrtle-239    30176.00 (   0.00%)    29984.00 (   0.64%)
> 
> We can see overall latency improvement and some throughput degradation
> when the system gets saturated.
> 
> Also, we run schbench (old version) on an EPYC 7543 system, which has
> 4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency:
> 
> case                    load            baseline(std%)  compare%( std%)
> normal                  4-mthreads-1-workers     1.00 (  6.47)   +9.02 (  4.68)
> normal                  4-mthreads-2-workers     1.00 (  3.25)  +28.03 (  8.76)
> normal                  4-mthreads-4-workers     1.00 (  6.67)   -4.32 (  2.58)
> normal                  4-mthreads-8-workers     1.00 (  2.38)   +1.27 (  2.41)
> normal                  4-mthreads-16-workers    1.00 (  5.61)   -8.48 (  4.39)
> normal                  4-mthreads-31-workers    1.00 (  9.31)   -0.22 (  9.77)
> 
> When the LLC is underloaded, the latency improvement is observed. When the LLC
> gets saturated, we observe some degradation.
> 

[..snip..]

> +static bool valid_target_cpu(int cpu, struct task_struct *p)
> +{
> +	int nr_running, llc_weight;
> +	unsigned long util, llc_cap;
> +
> +	if (!get_llc_stats(cpu, &nr_running, &llc_weight,
> +			   &util))
> +		return false;
> +
> +	llc_cap = llc_weight * SCHED_CAPACITY_SCALE;
> +
> +	/*
> +	 * If this process has many threads, be careful to avoid
> +	 * task stacking on the preferred LLC, by checking the system's
> +	 * utilization and runnable tasks. Otherwise, if this
> +	 * process does not have many threads, honor the cache
> +	 * aware wakeup.
> +	 */
> +	if (get_nr_threads(p) < llc_weight)
> +		return true;

IIUC, there might be scenarios were llc might be already overloaded with
threads of other process. In that case, we will be returning true for p in
above condition and don't check the below conditions. Shouldn't we check
the below two conditions either way?

Tested this patch with real life workload Daytrader, didn't see any regression.
It spawns lot of threads and is CPU intensive. So, I think it's not impacted
due to the below conditions.

Also, in schbench numbers provided by you, there is a degradation in saturated
case. Is it due to the overhead in computing the preferred llc which is not
being used due to below conditions?

Thanks,
Madadi Vineeth Reddy

> +
> +	/*
> +	 * Check if it exceeded 25% of average utiliazation,
> +	 * or if it exceeded 33% of CPUs. This is a magic number
> +	 * that did not cause heavy cache contention on Xeon or
> +	 * Zen.
> +	 */
> +	if (util * 4 >= llc_cap)
> +		return false;
> +
> +	if (nr_running * 3 >= llc_weight)
> +		return false;
> +
> +	return true;
> +}
> +

[..snip..]

Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated

Posted by Chen, Yu C 9 months, 2 weeks ago

Hi Madadi,

On 4/24/2025 5:22 PM, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> 
> On 21/04/25 08:55, Chen Yu wrote:
>> It is found that when the process's preferred LLC gets saturated by too many
>> threads, task contention is very frequent and causes performance regression.
>>
>> Save the per LLC statistics calculated by periodic load balance. The statistics
>> include the average utilization and the average number of runnable tasks.
>> The task wakeup path for cache aware scheduling manipulates these statistics
>> to inhibit cache aware scheduling to avoid performance regression. When either
>> the average utilization of the preferred LLC has reached 25%, or the average
>> number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware
>> wakeup is disabled. Only when the process has more threads than the LLC weight
>> will this restriction be enabled.
>>
>> Running schbench via mmtests on a Xeon platform, which has 2 sockets, each socket
>> has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via BIOS,
>> so there are 2 "LLCs" in 1 NUMA node.
>>
>> compare-mmtests.pl --directory work/log --benchmark schbench --names baseline,sched_cache
>>                                      baselin             sched_cach
>>                                     baseline            sched_cache
>> Lat 50.0th-qrtle-1          6.00 (   0.00%)        6.00 (   0.00%)
>> Lat 90.0th-qrtle-1         10.00 (   0.00%)        9.00 (  10.00%)
>> Lat 99.0th-qrtle-1         29.00 (   0.00%)       13.00 (  55.17%)
>> Lat 99.9th-qrtle-1         35.00 (   0.00%)       21.00 (  40.00%)
>> Lat 20.0th-qrtle-1        266.00 (   0.00%)      266.00 (   0.00%)
>> Lat 50.0th-qrtle-2          8.00 (   0.00%)        6.00 (  25.00%)
>> Lat 90.0th-qrtle-2         10.00 (   0.00%)       10.00 (   0.00%)
>> Lat 99.0th-qrtle-2         19.00 (   0.00%)       18.00 (   5.26%)
>> Lat 99.9th-qrtle-2         27.00 (   0.00%)       29.00 (  -7.41%)
>> Lat 20.0th-qrtle-2        533.00 (   0.00%)      507.00 (   4.88%)
>> Lat 50.0th-qrtle-4          6.00 (   0.00%)        5.00 (  16.67%)
>> Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
>> Lat 99.0th-qrtle-4         14.00 (   0.00%)        9.00 (  35.71%)
>> Lat 99.9th-qrtle-4         22.00 (   0.00%)       14.00 (  36.36%)
>> Lat 20.0th-qrtle-4       1070.00 (   0.00%)      995.00 (   7.01%)
>> Lat 50.0th-qrtle-8          5.00 (   0.00%)        5.00 (   0.00%)
>> Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
>> Lat 99.0th-qrtle-8         12.00 (   0.00%)       11.00 (   8.33%)
>> Lat 99.9th-qrtle-8         19.00 (   0.00%)       16.00 (  15.79%)
>> Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2140.00 (   0.00%)
>> Lat 50.0th-qrtle-16         6.00 (   0.00%)        5.00 (  16.67%)
>> Lat 90.0th-qrtle-16         7.00 (   0.00%)        5.00 (  28.57%)
>> Lat 99.0th-qrtle-16        12.00 (   0.00%)       10.00 (  16.67%)
>> Lat 99.9th-qrtle-16        17.00 (   0.00%)       14.00 (  17.65%)
>> Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4200.00 (   2.23%)
>> Lat 50.0th-qrtle-32         6.00 (   0.00%)        5.00 (  16.67%)
>> Lat 90.0th-qrtle-32         8.00 (   0.00%)        6.00 (  25.00%)
>> Lat 99.0th-qrtle-32        12.00 (   0.00%)       10.00 (  16.67%)
>> Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
>> Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8528.00 (  -0.38%)
>> Lat 50.0th-qrtle-64         6.00 (   0.00%)        5.00 (  16.67%)
>> Lat 90.0th-qrtle-64         8.00 (   0.00%)        8.00 (   0.00%)
>> Lat 99.0th-qrtle-64        12.00 (   0.00%)       12.00 (   0.00%)
>> Lat 99.9th-qrtle-64        17.00 (   0.00%)       17.00 (   0.00%)
>> Lat 20.0th-qrtle-64     17120.00 (   0.00%)    17120.00 (   0.00%)
>> Lat 50.0th-qrtle-128        7.00 (   0.00%)        7.00 (   0.00%)
>> Lat 90.0th-qrtle-128        9.00 (   0.00%)        9.00 (   0.00%)
>> Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
>> Lat 99.9th-qrtle-128       20.00 (   0.00%)       20.00 (   0.00%)
>> Lat 20.0th-qrtle-128    31776.00 (   0.00%)    30496.00 (   4.03%)
>> Lat 50.0th-qrtle-239        9.00 (   0.00%)        9.00 (   0.00%)
>> Lat 90.0th-qrtle-239       14.00 (   0.00%)       18.00 ( -28.57%)
>> Lat 99.0th-qrtle-239       43.00 (   0.00%)       56.00 ( -30.23%)
>> Lat 99.9th-qrtle-239      106.00 (   0.00%)      483.00 (-355.66%)
>> Lat 20.0th-qrtle-239    30176.00 (   0.00%)    29984.00 (   0.64%)
>>
>> We can see overall latency improvement and some throughput degradation
>> when the system gets saturated.
>>
>> Also, we run schbench (old version) on an EPYC 7543 system, which has
>> 4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency:
>>
>> case                    load            baseline(std%)  compare%( std%)
>> normal                  4-mthreads-1-workers     1.00 (  6.47)   +9.02 (  4.68)
>> normal                  4-mthreads-2-workers     1.00 (  3.25)  +28.03 (  8.76)
>> normal                  4-mthreads-4-workers     1.00 (  6.67)   -4.32 (  2.58)
>> normal                  4-mthreads-8-workers     1.00 (  2.38)   +1.27 (  2.41)
>> normal                  4-mthreads-16-workers    1.00 (  5.61)   -8.48 (  4.39)
>> normal                  4-mthreads-31-workers    1.00 (  9.31)   -0.22 (  9.77)
>>
>> When the LLC is underloaded, the latency improvement is observed. When the LLC
>> gets saturated, we observe some degradation.
>>
> 
> [..snip..]
> 
>> +static bool valid_target_cpu(int cpu, struct task_struct *p)
>> +{
>> +	int nr_running, llc_weight;
>> +	unsigned long util, llc_cap;
>> +
>> +	if (!get_llc_stats(cpu, &nr_running, &llc_weight,
>> +			   &util))
>> +		return false;
>> +
>> +	llc_cap = llc_weight * SCHED_CAPACITY_SCALE;
>> +
>> +	/*
>> +	 * If this process has many threads, be careful to avoid
>> +	 * task stacking on the preferred LLC, by checking the system's
>> +	 * utilization and runnable tasks. Otherwise, if this
>> +	 * process does not have many threads, honor the cache
>> +	 * aware wakeup.
>> +	 */
>> +	if (get_nr_threads(p) < llc_weight)
>> +		return true;
> 
> IIUC, there might be scenarios were llc might be already overloaded with
> threads of other process. In that case, we will be returning true for p in
> above condition and don't check the below conditions. Shouldn't we check
> the below two conditions either way?

The reason why get_nr_threads() was used is that we don't know if the 
following threshold is suitable for different workloads. We chose 25% 
and 33% because we found that it worked well for workload A, but was too 
low for workload B. Workload B requires the cache-aware scheduling to be 
enabled in any case, and the number of threads in B is smaller than the 
llc_weight. Therefore, we use the above check to meet the requirements 
of B. What you said is correct. We can remove the above checks on 
nr_thread and make the combination of utilization and nr_running a 
mandatory check, and then conduct further tuning.>
> Tested this patch with real life workload Daytrader, didn't see any regression.

Good to know the regression is gone.

> It spawns lot of threads and is CPU intensive. So, I think it's not impacted
> due to the below conditions.
> 
> Also, in schbench numbers provided by you, there is a degradation in saturated
> case. Is it due to the overhead in computing the preferred llc which is not
> being used due to below conditions?

Yes, the overhead of preferred LLC calculation could be one part, and we 
also suspect that the degradation might be tied to the task migrations. 
We still observed more task migrations than the baseline, even when the 
system was saturated (in theory, after 25% is exceeded, we should 
fallback to the generic task wakeup path). We haven't dug into that yet, 
and we can conduct an investigation in the following days.

thanks,
Chenyu>
> Thanks,
> Madadi Vineeth Reddy
> 
>> +
>> +	/*
>> +	 * Check if it exceeded 25% of average utiliazation,
>> +	 * or if it exceeded 33% of CPUs. This is a magic number
>> +	 * that did not cause heavy cache contention on Xeon or
>> +	 * Zen.
>> +	 */
>> +	if (util * 4 >= llc_cap)
>> +		return false;
>> +
>> +	if (nr_running * 3 >= llc_weight)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
> 
> [..snip..]
>

Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated

Posted by Madadi Vineeth Reddy 9 months, 2 weeks ago

On 24/04/25 19:41, Chen, Yu C wrote:
> Hi Madadi,
> 
> On 4/24/2025 5:22 PM, Madadi Vineeth Reddy wrote:
>> Hi Chen Yu,
>>
>> On 21/04/25 08:55, Chen Yu wrote:
>>> It is found that when the process's preferred LLC gets saturated by too many
>>> threads, task contention is very frequent and causes performance regression.
>>>
>>> Save the per LLC statistics calculated by periodic load balance. The statistics
>>> include the average utilization and the average number of runnable tasks.
>>> The task wakeup path for cache aware scheduling manipulates these statistics
>>> to inhibit cache aware scheduling to avoid performance regression. When either
>>> the average utilization of the preferred LLC has reached 25%, or the average
>>> number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware
>>> wakeup is disabled. Only when the process has more threads than the LLC weight
>>> will this restriction be enabled.
>>>
>>> Running schbench via mmtests on a Xeon platform, which has 2 sockets, each socket
>>> has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via BIOS,
>>> so there are 2 "LLCs" in 1 NUMA node.
>>>
>>> compare-mmtests.pl --directory work/log --benchmark schbench --names baseline,sched_cache
>>>                                      baselin             sched_cach
>>>                                     baseline            sched_cache
>>> Lat 50.0th-qrtle-1          6.00 (   0.00%)        6.00 (   0.00%)
>>> Lat 90.0th-qrtle-1         10.00 (   0.00%)        9.00 (  10.00%)
>>> Lat 99.0th-qrtle-1         29.00 (   0.00%)       13.00 (  55.17%)
>>> Lat 99.9th-qrtle-1         35.00 (   0.00%)       21.00 (  40.00%)
>>> Lat 20.0th-qrtle-1        266.00 (   0.00%)      266.00 (   0.00%)
>>> Lat 50.0th-qrtle-2          8.00 (   0.00%)        6.00 (  25.00%)
>>> Lat 90.0th-qrtle-2         10.00 (   0.00%)       10.00 (   0.00%)
>>> Lat 99.0th-qrtle-2         19.00 (   0.00%)       18.00 (   5.26%)
>>> Lat 99.9th-qrtle-2         27.00 (   0.00%)       29.00 (  -7.41%)
>>> Lat 20.0th-qrtle-2        533.00 (   0.00%)      507.00 (   4.88%)
>>> Lat 50.0th-qrtle-4          6.00 (   0.00%)        5.00 (  16.67%)
>>> Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
>>> Lat 99.0th-qrtle-4         14.00 (   0.00%)        9.00 (  35.71%)
>>> Lat 99.9th-qrtle-4         22.00 (   0.00%)       14.00 (  36.36%)
>>> Lat 20.0th-qrtle-4       1070.00 (   0.00%)      995.00 (   7.01%)
>>> Lat 50.0th-qrtle-8          5.00 (   0.00%)        5.00 (   0.00%)
>>> Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
>>> Lat 99.0th-qrtle-8         12.00 (   0.00%)       11.00 (   8.33%)
>>> Lat 99.9th-qrtle-8         19.00 (   0.00%)       16.00 (  15.79%)
>>> Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2140.00 (   0.00%)
>>> Lat 50.0th-qrtle-16         6.00 (   0.00%)        5.00 (  16.67%)
>>> Lat 90.0th-qrtle-16         7.00 (   0.00%)        5.00 (  28.57%)
>>> Lat 99.0th-qrtle-16        12.00 (   0.00%)       10.00 (  16.67%)
>>> Lat 99.9th-qrtle-16        17.00 (   0.00%)       14.00 (  17.65%)
>>> Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4200.00 (   2.23%)
>>> Lat 50.0th-qrtle-32         6.00 (   0.00%)        5.00 (  16.67%)
>>> Lat 90.0th-qrtle-32         8.00 (   0.00%)        6.00 (  25.00%)
>>> Lat 99.0th-qrtle-32        12.00 (   0.00%)       10.00 (  16.67%)
>>> Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
>>> Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8528.00 (  -0.38%)
>>> Lat 50.0th-qrtle-64         6.00 (   0.00%)        5.00 (  16.67%)
>>> Lat 90.0th-qrtle-64         8.00 (   0.00%)        8.00 (   0.00%)
>>> Lat 99.0th-qrtle-64        12.00 (   0.00%)       12.00 (   0.00%)
>>> Lat 99.9th-qrtle-64        17.00 (   0.00%)       17.00 (   0.00%)
>>> Lat 20.0th-qrtle-64     17120.00 (   0.00%)    17120.00 (   0.00%)
>>> Lat 50.0th-qrtle-128        7.00 (   0.00%)        7.00 (   0.00%)
>>> Lat 90.0th-qrtle-128        9.00 (   0.00%)        9.00 (   0.00%)
>>> Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
>>> Lat 99.9th-qrtle-128       20.00 (   0.00%)       20.00 (   0.00%)
>>> Lat 20.0th-qrtle-128    31776.00 (   0.00%)    30496.00 (   4.03%)
>>> Lat 50.0th-qrtle-239        9.00 (   0.00%)        9.00 (   0.00%)
>>> Lat 90.0th-qrtle-239       14.00 (   0.00%)       18.00 ( -28.57%)
>>> Lat 99.0th-qrtle-239       43.00 (   0.00%)       56.00 ( -30.23%)
>>> Lat 99.9th-qrtle-239      106.00 (   0.00%)      483.00 (-355.66%)
>>> Lat 20.0th-qrtle-239    30176.00 (   0.00%)    29984.00 (   0.64%)
>>>
>>> We can see overall latency improvement and some throughput degradation
>>> when the system gets saturated.
>>>
>>> Also, we run schbench (old version) on an EPYC 7543 system, which has
>>> 4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency:
>>>
>>> case                    load            baseline(std%)  compare%( std%)
>>> normal                  4-mthreads-1-workers     1.00 (  6.47)   +9.02 (  4.68)
>>> normal                  4-mthreads-2-workers     1.00 (  3.25)  +28.03 (  8.76)
>>> normal                  4-mthreads-4-workers     1.00 (  6.67)   -4.32 (  2.58)
>>> normal                  4-mthreads-8-workers     1.00 (  2.38)   +1.27 (  2.41)
>>> normal                  4-mthreads-16-workers    1.00 (  5.61)   -8.48 (  4.39)
>>> normal                  4-mthreads-31-workers    1.00 (  9.31)   -0.22 (  9.77)
>>>
>>> When the LLC is underloaded, the latency improvement is observed. When the LLC
>>> gets saturated, we observe some degradation.
>>>
>>
>> [..snip..]
>>
>>> +static bool valid_target_cpu(int cpu, struct task_struct *p)
>>> +{
>>> +    int nr_running, llc_weight;
>>> +    unsigned long util, llc_cap;
>>> +
>>> +    if (!get_llc_stats(cpu, &nr_running, &llc_weight,
>>> +               &util))
>>> +        return false;
>>> +
>>> +    llc_cap = llc_weight * SCHED_CAPACITY_SCALE;
>>> +
>>> +    /*
>>> +     * If this process has many threads, be careful to avoid
>>> +     * task stacking on the preferred LLC, by checking the system's
>>> +     * utilization and runnable tasks. Otherwise, if this
>>> +     * process does not have many threads, honor the cache
>>> +     * aware wakeup.
>>> +     */
>>> +    if (get_nr_threads(p) < llc_weight)
>>> +        return true;
>>
>> IIUC, there might be scenarios were llc might be already overloaded with
>> threads of other process. In that case, we will be returning true for p in
>> above condition and don't check the below conditions. Shouldn't we check
>> the below two conditions either way?
> 
> The reason why get_nr_threads() was used is that we don't know if the following threshold is suitable for different workloads. We chose 25% and 33% because we found that it worked well for workload A, but was too low for workload B. Workload B requires the cache-aware scheduling to be enabled in any case, and the number of threads in B is smaller than the llc_weight. Therefore, we use the above check to meet the requirements of B. What you said is correct. We can remove the above checks on nr_thread and make the combination of utilization and nr_running a mandatory check, and then conduct further tuning.>

Thanks Chen. It's always tricky to make all workloads happy. As long as
we're not regressing too much on the others, it should be fine I guess given
the overall impact is positive.

JFYI, In Power10, LLC is at a small core level containing 4 threads. So,
nr_running on LLC can't be more than 1 for cache aware scheduling to work.

>> Tested this patch with real life workload Daytrader, didn't see any regression.
> 
> Good to know the regression is gone.
> 
>> It spawns lot of threads and is CPU intensive. So, I think it's not impacted
>> due to the below conditions.
>>
>> Also, in schbench numbers provided by you, there is a degradation in saturated
>> case. Is it due to the overhead in computing the preferred llc which is not
>> being used due to below conditions?
> 
> Yes, the overhead of preferred LLC calculation could be one part, and we also suspect that the degradation might be tied to the task migrations. We still observed more task migrations than the baseline, even when the system was saturated (in theory, after 25% is exceeded, we should fallback to the generic task wakeup path). We haven't dug into that yet, and we can conduct an investigation in the following days.

Interesting. I will also try to look into these extra migrations.

Thanks,
Madadi Vineeth Reddy

> 
> thanks,
> Chenyu>
>> Thanks,
>> Madadi Vineeth Reddy
>>
>>> +
>>> +    /*
>>> +     * Check if it exceeded 25% of average utiliazation,
>>> +     * or if it exceeded 33% of CPUs. This is a magic number
>>> +     * that did not cause heavy cache contention on Xeon or
>>> +     * Zen.
>>> +     */
>>> +    if (util * 4 >= llc_cap)
>>> +        return false;
>>> +
>>> +    if (nr_running * 3 >= llc_weight)
>>> +        return false;
>>> +
>>> +    return true;
>>> +}
>>> +
>>
>> [..snip..]
>>

Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated

Posted by Tim Chen 9 months, 2 weeks ago

On Thu, 2025-04-24 at 22:11 +0800, Chen, Yu C wrote:
> 
> > It spawns lot of threads and is CPU intensive. So, I think it's not impacted
> > due to the below conditions.
> > 
> > Also, in schbench numbers provided by you, there is a degradation in saturated
> > case. Is it due to the overhead in computing the preferred llc which is not
> > being used due to below conditions?
> 
> Yes, the overhead of preferred LLC calculation could be one part, and we 
> also suspect that the degradation might be tied to the task migrations. 
> We still observed more task migrations than the baseline, even when the 
> system was saturated (in theory, after 25% is exceeded, we should 
> fallback to the generic task wakeup path). We haven't dug into that yet, 
> and we can conduct an investigation in the following days.

In the saturation case it is mostly the tail latency that has regression.
The preferred LLC has a tendency to have higher load than the
other LLCs. Load balancer will try to move tasks out and wake balance will
try to move it back to the preferred LLC. This increases the task migrations
and affect tail latency.

Tim

Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated

Posted by Madadi Vineeth Reddy 9 months, 2 weeks ago

Hi Tim,

On 24/04/25 21:21, Tim Chen wrote:
> On Thu, 2025-04-24 at 22:11 +0800, Chen, Yu C wrote:
>>
>>> It spawns lot of threads and is CPU intensive. So, I think it's not impacted
>>> due to the below conditions.
>>>
>>> Also, in schbench numbers provided by you, there is a degradation in saturated
>>> case. Is it due to the overhead in computing the preferred llc which is not
>>> being used due to below conditions?
>>
>> Yes, the overhead of preferred LLC calculation could be one part, and we 
>> also suspect that the degradation might be tied to the task migrations. 
>> We still observed more task migrations than the baseline, even when the 
>> system was saturated (in theory, after 25% is exceeded, we should 
>> fallback to the generic task wakeup path). We haven't dug into that yet, 
>> and we can conduct an investigation in the following days.
> 
> In the saturation case it is mostly the tail latency that has regression.
> The preferred LLC has a tendency to have higher load than the
> other LLCs. Load balancer will try to move tasks out and wake balance will
> try to move it back to the preferred LLC. This increases the task migrations
> and affect tail latency.

Why would the task be moved back to the preferred LLC in wakeup path for the
saturated case? The checks shouldn't allow it right?

Thanks,
Madadi Vineeth Reddy

> 
> Tim

Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated

Posted by Tim Chen 9 months, 2 weeks ago

On Fri, 2025-04-25 at 14:43 +0530, Madadi Vineeth Reddy wrote:
> Hi Tim,
> 
> On 24/04/25 21:21, Tim Chen wrote:
> > On Thu, 2025-04-24 at 22:11 +0800, Chen, Yu C wrote:
> > > 
> > > > It spawns lot of threads and is CPU intensive. So, I think it's not impacted
> > > > due to the below conditions.
> > > > 
> > > > Also, in schbench numbers provided by you, there is a degradation in saturated
> > > > case. Is it due to the overhead in computing the preferred llc which is not
> > > > being used due to below conditions?
> > > 
> > > Yes, the overhead of preferred LLC calculation could be one part, and we 
> > > also suspect that the degradation might be tied to the task migrations. 
> > > We still observed more task migrations than the baseline, even when the 
> > > system was saturated (in theory, after 25% is exceeded, we should 
> > > fallback to the generic task wakeup path). We haven't dug into that yet, 
> > > and we can conduct an investigation in the following days.
> > 
> > In the saturation case it is mostly the tail latency that has regression.
> > The preferred LLC has a tendency to have higher load than the
> > other LLCs. Load balancer will try to move tasks out and wake balance will
> > try to move it back to the preferred LLC. This increases the task migrations
> > and affect tail latency.
> 
> Why would the task be moved back to the preferred LLC in wakeup path for the
> saturated case? The checks shouldn't allow it right?

The task wake ups happens very frequently in schbench and it takes a while for utilization to catch
up. The utilization of the LLC is updated at the load balance time of LLC. 

So once utilization falls below the utilization threshold, there is a window
where the woken tasks will rush into the preferred LLC until the utilization
is updated at the next load balance time. 

Tim


> 
> Thanks,
> Madadi Vineeth Reddy
> 
> > 
> > Tim
>

[RFC PATCH 1/5] sched: Cache aware load-balancing
[RFC PATCH 2/5] sched: Several fixes for cache aware scheduling
[RFC PATCH 3/5] sched: Avoid task migration within its preferred LLC
[RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated
[RFC PATCH 5/5] sched: Add ftrace to track task migration and load balance within and across LLC