sched: Introduce Cache aware scheduling

[RFC PATCH 0/5] sched: Introduce Cache aware scheduling

Posted by Chen Yu 9 months, 3 weeks ago

This is a respin of the cache-aware scheduling proposed by Peter[1].
In this patch set, some known issues in [1] were addressed, and the performance
regression was investigated and mitigated.

Cache-aware scheduling aims to aggregate tasks with potential shared resources
into the same cache domain. This approach enhances cache locality, thereby optimizing
system performance by reducing cache misses and improving data access efficiency.

In the current implementation, threads within the same process are considered as
entities that potentially share resources. Cache-aware scheduling monitors the CPU
occupancy of each cache domain for every process. Based on this monitoring, it endeavors
to migrate threads within a given process to its cache-hot domains, with the goal of
maximizing cache locality.

Patch 1 constitutes the fundamental cache-aware scheduling. It is the same patch as [1].
Patch 2 comprises a series of fixes for Patch 1, including compiling warnings and functional
fixes.
Patch 3 fixes performance degradation that arise from excessive task migrations within the
preferred LLC domain.
Patch 4 further alleviates performance regressions when the preferred LLC becomes saturated.
Patch 5 introduces ftrace events, which is used to track task migrations triggered by wakeup
and load balancer. This addition facilitate performance regression analysis.

The patch set is applied on top of v6.14 sched/core,
commit 4ba7518327c6 ("sched/debug: Print the local group's asym_prefer_cpu")

schbench was tested on EMR and Zen3 Milan. An improvement in tail latency was observed when
the LLC was underloaded; however, some regressions were still evident when the LLC was
saturated. Additionally, the load balance should be adjusted to further address these
regressions.

[1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/

Chen Yu (4):
sched: Several fixes for cache aware scheduling
sched: Avoid task migration within its preferred LLC
sched: Inhibit cache aware scheduling if the preferred LLC is over
aggregated
sched: Add ftrace to track task migration and load balance within and
across LLC

Peter Zijlstra (1):
sched: Cache aware load-balancing

--
2.25.1

Re: [RFC PATCH 0/5] sched: Introduce Cache aware scheduling

Posted by K Prateek Nayak 9 months, 2 weeks ago

Hello Chenyu,

On 4/21/2025 8:53 AM, Chen Yu wrote:
> This is a respin of the cache-aware scheduling proposed by Peter[1].
> In this patch set, some known issues in [1] were addressed, and the performance
> regression was investigated and mitigated.
> 
> Cache-aware scheduling aims to aggregate tasks with potential shared resources
> into the same cache domain. This approach enhances cache locality, thereby optimizing
> system performance by reducing cache misses and improving data access efficiency.
> 
> In the current implementation, threads within the same process are considered as
> entities that potentially share resources. Cache-aware scheduling monitors the CPU
> occupancy of each cache domain for every process. Based on this monitoring, it endeavors
> to migrate threads within a given process to its cache-hot domains, with the goal of
> maximizing cache locality.
> 
> Patch 1 constitutes the fundamental cache-aware scheduling. It is the same patch as [1].
> Patch 2 comprises a series of fixes for Patch 1, including compiling warnings and functional
> fixes.
> Patch 3 fixes performance degradation that arise from excessive task migrations within the
> preferred LLC domain.
> Patch 4 further alleviates performance regressions when the preferred LLC becomes saturated.
> Patch 5 introduces ftrace events, which is used to track task migrations triggered by wakeup
> and load balancer. This addition facilitate performance regression analysis.
> 
> The patch set is applied on top of v6.14 sched/core,
> commit 4ba7518327c6 ("sched/debug: Print the local group's asym_prefer_cpu")
> 

Thank you for working on this! I have been a bit preoccupied but I
promise to look into the regressions I've reported below sometime
this week and report back soon on what seems to make them unhappy.

tl;dr

o Most regressions aren't as severe as v1 thanks to all the work
   from you and Abel.

o I too see schbench regress in fully loaded cases but the old
   schbench tail latencies improve when #threads < #CPUs in LLC

o There is a consistent regression in tbench - what I presume is
   happening there is all threads of "tbench_srv" share an mm and
   and all the tbench clients share an mm but for best performance,
   the wakeups between client and server must be local (same core /
   same LLC) but either the cost of additional search build up or
   the clients get co-located as one set of entities and the
   servers get colocated as another set of entities leading to
   mostly remote wakeups.

   Not too sure if netperf has similar architecture as tbench but
   that too sees a regression.

o Longer running benchmarks see a regression. Still not sure if
   this is because of additional search or something else.

I'll leave the full results below:

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Benchmark results

   ==================================================================
   Test          : hackbench
   Units         : Normalized time in seconds
   Interpretation: Lower is better
   Statistic     : AMean
   ==================================================================
   Case:           tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
    1-groups     1.00 [ -0.00]( 9.02)     1.03 [ -3.38](11.44)
    2-groups     1.00 [ -0.00]( 6.86)     0.98 [  2.20]( 6.61)
    4-groups     1.00 [ -0.00]( 2.73)     1.00 [  0.42]( 4.00)
    8-groups     1.00 [ -0.00]( 1.21)     1.04 [ -4.00]( 5.59)
   16-groups     1.00 [ -0.00]( 0.97)     1.01 [ -0.52]( 2.12)


   ==================================================================
   Test          : tbench
   Units         : Normalized throughput
   Interpretation: Higher is better
   Statistic     : AMean
   ==================================================================
   Clients:    tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
       1     1.00 [  0.00]( 0.67)     0.96 [ -3.95]( 0.55)
       2     1.00 [  0.00]( 0.85)     0.98 [ -1.69]( 0.65)
       4     1.00 [  0.00]( 0.52)     0.96 [ -3.68]( 0.09)
       8     1.00 [  0.00]( 0.92)     0.96 [ -4.06]( 0.43)
      16     1.00 [  0.00]( 1.01)     0.95 [ -5.19]( 1.65)
      32     1.00 [  0.00]( 1.35)     0.95 [ -4.79]( 0.29)
      64     1.00 [  0.00]( 1.22)     0.94 [ -6.49]( 1.46)
     128     1.00 [  0.00]( 2.39)     0.92 [ -7.61]( 1.41)
     256     1.00 [  0.00]( 1.83)     0.92 [ -8.24]( 0.35)
     512     1.00 [  0.00]( 0.17)     0.93 [ -7.08]( 0.22)
    1024     1.00 [  0.00]( 0.31)     0.91 [ -8.57]( 0.29)


   ==================================================================
   Test          : stream-10
   Units         : Normalized Bandwidth, MB/s
   Interpretation: Higher is better
   Statistic     : HMean
   ==================================================================
   Test:       tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
    Copy     1.00 [  0.00]( 8.24)     1.03 [  2.66]( 6.15)
   Scale     1.00 [  0.00]( 5.62)     0.99 [ -1.43]( 6.32)
     Add     1.00 [  0.00]( 6.18)     0.97 [ -3.12]( 5.70)
   Triad     1.00 [  0.00]( 5.29)     1.01 [  1.31]( 3.82)


   ==================================================================
   Test          : stream-100
   Units         : Normalized Bandwidth, MB/s
   Interpretation: Higher is better
   Statistic     : HMean
   ==================================================================
   Test:       tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
    Copy     1.00 [  0.00]( 2.92)     0.99 [ -1.47]( 5.02)
   Scale     1.00 [  0.00]( 4.80)     0.98 [ -2.08]( 5.53)
     Add     1.00 [  0.00]( 4.35)     0.98 [ -1.85]( 4.26)
   Triad     1.00 [  0.00]( 2.30)     0.99 [ -0.84]( 1.83)


   ==================================================================
   Test          : netperf
   Units         : Normalized Througput
   Interpretation: Higher is better
   Statistic     : AMean
   ==================================================================
   Clients:         tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
    1-clients     1.00 [  0.00]( 0.17)     0.97 [ -2.55]( 0.50)
    2-clients     1.00 [  0.00]( 0.77)     0.97 [ -2.52]( 0.20)
    4-clients     1.00 [  0.00]( 0.93)     0.97 [ -3.30]( 0.54)
    8-clients     1.00 [  0.00]( 0.87)     0.96 [ -3.98]( 1.19)
   16-clients     1.00 [  0.00]( 1.15)     0.96 [ -4.16]( 1.06)
   32-clients     1.00 [  0.00]( 1.00)     0.95 [ -5.47]( 0.96)
   64-clients     1.00 [  0.00]( 1.37)     0.94 [ -5.75]( 1.64)
   128-clients    1.00 [  0.00]( 0.99)     0.92 [ -8.50]( 1.49)
   256-clients    1.00 [  0.00]( 3.23)     0.90 [-10.22]( 2.86)
   512-clients    1.00 [  0.00](58.43)     0.90 [-10.28](47.59)


   ==================================================================
   Test          : schbench
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
     1     1.00 [ -0.00]( 5.59)     0.55 [ 45.00](11.17)
     2     1.00 [ -0.00](14.29)     0.52 [ 47.62]( 7.53)
     4     1.00 [ -0.00]( 1.24)     0.57 [ 42.55]( 5.73)
     8     1.00 [ -0.00](11.16)     1.06 [ -6.12]( 2.92)
    16     1.00 [ -0.00]( 6.81)     1.12 [-12.28](11.09)
    32     1.00 [ -0.00]( 6.99)     1.05 [ -5.26](12.48)
    64     1.00 [ -0.00]( 6.00)     0.96 [  4.21](18.31)
   128     1.00 [ -0.00]( 3.26)     1.63 [-62.84](36.71)
   256     1.00 [ -0.00](19.29)     0.97 [  3.25]( 4.94)
   512     1.00 [ -0.00]( 1.48)     1.05 [ -4.71]( 5.11)


   ==================================================================
   Test          : new-schbench-requests-per-second
   Units         : Normalized Requests per second
   Interpretation: Higher is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
     1     1.00 [  0.00]( 0.00)     0.95 [ -4.99]( 0.48)
     2     1.00 [  0.00]( 0.26)     0.96 [ -3.82]( 0.55)
     4     1.00 [  0.00]( 0.15)     0.95 [ -4.96]( 0.27)
     8     1.00 [  0.00]( 0.15)     0.99 [ -0.58]( 0.00)
    16     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
    32     1.00 [  0.00]( 4.88)     1.04 [  4.27]( 2.42)
    64     1.00 [  0.00]( 5.57)     0.87 [-13.10](11.51)
   128     1.00 [  0.00]( 0.34)     0.97 [ -3.13]( 0.58)
   256     1.00 [  0.00]( 1.95)     1.02 [  1.83]( 0.15)
   512     1.00 [  0.00]( 0.44)     1.00 [  0.48]( 0.12)


   ==================================================================
   Test          : new-schbench-wakeup-latency
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
     1     1.00 [ -0.00]( 4.19)     1.00 [ -0.00](14.91)
     2     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 0.00)
     4     1.00 [ -0.00]( 8.91)     0.80 [ 20.00]( 4.43)
     8     1.00 [ -0.00]( 7.45)     1.00 [ -0.00]( 7.45)
    16     1.00 [ -0.00]( 4.08)     1.00 [ -0.00](10.79)
    32     1.00 [ -0.00](16.90)     0.93 [  6.67](10.00)
    64     1.00 [ -0.00]( 9.11)     1.12 [-12.50]( 0.00)
   128     1.00 [ -0.00]( 7.05)     2.43 [-142.86](24.47)
   256     1.00 [ -0.00]( 4.32)     1.02 [ -2.34]( 1.20)
   512     1.00 [ -0.00]( 0.35)     1.01 [ -0.77]( 0.40)


   ==================================================================
   Test          : new-schbench-request-latency
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
     1     1.00 [ -0.00]( 0.78)     1.16 [-15.70]( 2.14)
     2     1.00 [ -0.00]( 0.81)     1.13 [-13.11]( 0.62)
     4     1.00 [ -0.00]( 0.24)     1.26 [-26.11](16.43)
     8     1.00 [ -0.00]( 1.30)     1.03 [ -3.46]( 0.81)
    16     1.00 [ -0.00]( 1.11)     1.02 [ -2.12]( 1.85)
    32     1.00 [ -0.00]( 5.94)     0.96 [  4.05]( 4.48)
    64     1.00 [ -0.00]( 6.27)     1.06 [ -6.01]( 6.67)
   128     1.00 [ -0.00]( 0.21)     1.12 [-12.31]( 2.61)
   256     1.00 [ -0.00](13.73)     1.06 [ -6.30]( 3.37)
   512     1.00 [ -0.00]( 0.95)     1.05 [ -4.85]( 0.61)


   ==================================================================
   Test          : Various longer running benchmarks
   Units         : %diff in throughput reported
   Interpretation: Higher is better
   Statistic     : Median
   ==================================================================
   Benchmarks:                 %diff
   ycsb-cassandra              -1.21%
   ycsb-mongodb                -0.69%

   deathstarbench-1x           -7.40%
   deathstarbench-2x           -3.80%
   deathstarbench-3x           -3.99%
   deathstarbench-6x           -3.02%

   hammerdb+mysql 16VU         -2.59%
   hammerdb+mysql 64VU         -1.05%


Also, could you fold the below diff into your Patch2:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb5a2572b4f8..6c51dd2b7b32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7694,8 +7694,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
  	struct sched_domain_shared *sd_share;
  
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
  	if (sched_feat(SIS_UTIL)) {
  		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
  		if (sd_share) {
@@ -7707,6 +7705,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
  		}
  	}
  
+	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
  	if (static_branch_unlikely(&sched_cluster_active)) {
  		struct sched_group *sg = sd->groups;
  
---

If the SIS_UTIL cut off hits, that result of the cpumask_and() is of no
use. To save some additional cycles, especially in cases where we target
the LLC frequently and the search bails out because the LLC is busy,
this overhead can be easily avoided. Since select_idle_cpu() can now be
called twice per wakeup, this overhead can be visible in benchmarks like
hackbench.

-- 
Thanks and Regards,
Prateek

> schbench was tested on EMR and Zen3 Milan. An improvement in tail latency was observed when
> the LLC was underloaded; however, some regressions were still evident when the LLC was
> saturated. Additionally, the load balance should be adjusted to further address these
> regressions.
> 
> [1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> 
> 
> Chen Yu (4):
>    sched: Several fixes for cache aware scheduling
>    sched: Avoid task migration within its preferred LLC
>    sched: Inhibit cache aware scheduling if the preferred LLC is over
>      aggregated
>    sched: Add ftrace to track task migration and load balance within and
>      across LLC
> 
> Peter Zijlstra (1):
>    sched: Cache aware load-balancing
>

Re: [RFC PATCH 0/5] sched: Introduce Cache aware scheduling

Posted by Chen, Yu C 9 months, 2 weeks ago

Hi Prateek,

On 4/29/2025 11:47 AM, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 4/21/2025 8:53 AM, Chen Yu wrote:
>> This is a respin of the cache-aware scheduling proposed by Peter[1].
>> In this patch set, some known issues in [1] were addressed, and the 
>> performance
>> regression was investigated and mitigated.
>>
>> Cache-aware scheduling aims to aggregate tasks with potential shared 
>> resources
>> into the same cache domain. This approach enhances cache locality, 
>> thereby optimizing
>> system performance by reducing cache misses and improving data access 
>> efficiency.
>>
>> In the current implementation, threads within the same process are 
>> considered as
>> entities that potentially share resources. Cache-aware scheduling 
>> monitors the CPU
>> occupancy of each cache domain for every process. Based on this 
>> monitoring, it endeavors
>> to migrate threads within a given process to its cache-hot domains, 
>> with the goal of
>> maximizing cache locality.
>>
>> Patch 1 constitutes the fundamental cache-aware scheduling. It is the 
>> same patch as [1].
>> Patch 2 comprises a series of fixes for Patch 1, including compiling 
>> warnings and functional
>> fixes.
>> Patch 3 fixes performance degradation that arise from excessive task 
>> migrations within the
>> preferred LLC domain.
>> Patch 4 further alleviates performance regressions when the preferred 
>> LLC becomes saturated.
>> Patch 5 introduces ftrace events, which is used to track task 
>> migrations triggered by wakeup
>> and load balancer. This addition facilitate performance regression 
>> analysis.
>>
>> The patch set is applied on top of v6.14 sched/core,
>> commit 4ba7518327c6 ("sched/debug: Print the local group's 
>> asym_prefer_cpu")
>>
> 
> Thank you for working on this! I have been a bit preoccupied but I
> promise to look into the regressions I've reported below sometime
> this week and report back soon on what seems to make them unhappy.
> 

Thanks for your time on this testings.

> tl;dr
> 
> o Most regressions aren't as severe as v1 thanks to all the work
>    from you and Abel.
> 
> o I too see schbench regress in fully loaded cases but the old
>    schbench tail latencies improve when #threads < #CPUs in LLC
> 
> o There is a consistent regression in tbench - what I presume is
>    happening there is all threads of "tbench_srv" share an mm and
>    and all the tbench clients share an mm but for best performance,
>    the wakeups between client and server must be local (same core /
>    same LLC) but either the cost of additional search build up or
>    the clients get co-located as one set of entities and the
>    servers get colocated as another set of entities leading to
>    mostly remote wakeups.

This is a good point. If A and B are both multi-threaded processes,
and A interacts with B frequently, we should not only consider
aggregating the threads within A and B, but also placing A and
B together. I'm not sure if WF_SYNC is carried along and takes
effect during the tbench socket wakeup process. I'll also try
tbench/netperf testings.

> 
>    Not too sure if netperf has similar architecture as tbench but
>    that too sees a regression.
> 
> o Longer running benchmarks see a regression. Still not sure if
>    this is because of additional search or something else.
> 
> I'll leave the full results below:
> 
> o Machine details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)

> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> 
> o Benchmark results
> 
>    ==================================================================
>    Test          : hackbench
>    Units         : Normalized time in seconds
>    Interpretation: Lower is better
>    Statistic     : AMean
>    ==================================================================
>    Case:           tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>     1-groups     1.00 [ -0.00]( 9.02)     1.03 [ -3.38](11.44)
>     2-groups     1.00 [ -0.00]( 6.86)     0.98 [  2.20]( 6.61)
>     4-groups     1.00 [ -0.00]( 2.73)     1.00 [  0.42]( 4.00)
>     8-groups     1.00 [ -0.00]( 1.21)     1.04 [ -4.00]( 5.59)
>    16-groups     1.00 [ -0.00]( 0.97)     1.01 [ -0.52]( 2.12)
> 
> 
>    ==================================================================
>    Test          : tbench
>    Units         : Normalized throughput
>    Interpretation: Higher is better
>    Statistic     : AMean
>    ==================================================================
>    Clients:    tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>        1     1.00 [  0.00]( 0.67)     0.96 [ -3.95]( 0.55)
>        2     1.00 [  0.00]( 0.85)     0.98 [ -1.69]( 0.65)
>        4     1.00 [  0.00]( 0.52)     0.96 [ -3.68]( 0.09)
>        8     1.00 [  0.00]( 0.92)     0.96 [ -4.06]( 0.43)
>       16     1.00 [  0.00]( 1.01)     0.95 [ -5.19]( 1.65)
>       32     1.00 [  0.00]( 1.35)     0.95 [ -4.79]( 0.29)
>       64     1.00 [  0.00]( 1.22)     0.94 [ -6.49]( 1.46)
>      128     1.00 [  0.00]( 2.39)     0.92 [ -7.61]( 1.41)
>      256     1.00 [  0.00]( 1.83)     0.92 [ -8.24]( 0.35)
>      512     1.00 [  0.00]( 0.17)     0.93 [ -7.08]( 0.22)
>     1024     1.00 [  0.00]( 0.31)     0.91 [ -8.57]( 0.29)
> 
> 
>    ==================================================================
>    Test          : stream-10
>    Units         : Normalized Bandwidth, MB/s
>    Interpretation: Higher is better
>    Statistic     : HMean
>    ==================================================================
>    Test:       tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>     Copy     1.00 [  0.00]( 8.24)     1.03 [  2.66]( 6.15)
>    Scale     1.00 [  0.00]( 5.62)     0.99 [ -1.43]( 6.32)
>      Add     1.00 [  0.00]( 6.18)     0.97 [ -3.12]( 5.70)
>    Triad     1.00 [  0.00]( 5.29)     1.01 [  1.31]( 3.82)
> 
> 
>    ==================================================================
>    Test          : stream-100
>    Units         : Normalized Bandwidth, MB/s
>    Interpretation: Higher is better
>    Statistic     : HMean
>    ==================================================================
>    Test:       tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>     Copy     1.00 [  0.00]( 2.92)     0.99 [ -1.47]( 5.02)
>    Scale     1.00 [  0.00]( 4.80)     0.98 [ -2.08]( 5.53)
>      Add     1.00 [  0.00]( 4.35)     0.98 [ -1.85]( 4.26)
>    Triad     1.00 [  0.00]( 2.30)     0.99 [ -0.84]( 1.83)
> 
> 
>    ==================================================================
>    Test          : netperf
>    Units         : Normalized Througput
>    Interpretation: Higher is better
>    Statistic     : AMean
>    ==================================================================
>    Clients:         tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>     1-clients     1.00 [  0.00]( 0.17)     0.97 [ -2.55]( 0.50)
>     2-clients     1.00 [  0.00]( 0.77)     0.97 [ -2.52]( 0.20)
>     4-clients     1.00 [  0.00]( 0.93)     0.97 [ -3.30]( 0.54)
>     8-clients     1.00 [  0.00]( 0.87)     0.96 [ -3.98]( 1.19)
>    16-clients     1.00 [  0.00]( 1.15)     0.96 [ -4.16]( 1.06)
>    32-clients     1.00 [  0.00]( 1.00)     0.95 [ -5.47]( 0.96)
>    64-clients     1.00 [  0.00]( 1.37)     0.94 [ -5.75]( 1.64)
>    128-clients    1.00 [  0.00]( 0.99)     0.92 [ -8.50]( 1.49)
>    256-clients    1.00 [  0.00]( 3.23)     0.90 [-10.22]( 2.86)
>    512-clients    1.00 [  0.00](58.43)     0.90 [-10.28](47.59)
> 
> 
>    ==================================================================
>    Test          : schbench
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>      1     1.00 [ -0.00]( 5.59)     0.55 [ 45.00](11.17)
>      2     1.00 [ -0.00](14.29)     0.52 [ 47.62]( 7.53)
>      4     1.00 [ -0.00]( 1.24)     0.57 [ 42.55]( 5.73)
>      8     1.00 [ -0.00](11.16)     1.06 [ -6.12]( 2.92)
>     16     1.00 [ -0.00]( 6.81)     1.12 [-12.28](11.09)
>     32     1.00 [ -0.00]( 6.99)     1.05 [ -5.26](12.48)
>     64     1.00 [ -0.00]( 6.00)     0.96 [  4.21](18.31)
>    128     1.00 [ -0.00]( 3.26)     1.63 [-62.84](36.71)
>    256     1.00 [ -0.00](19.29)     0.97 [  3.25]( 4.94)
>    512     1.00 [ -0.00]( 1.48)     1.05 [ -4.71]( 5.11)
> 
> 
>    ==================================================================
>    Test          : new-schbench-requests-per-second
>    Units         : Normalized Requests per second
>    Interpretation: Higher is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>      1     1.00 [  0.00]( 0.00)     0.95 [ -4.99]( 0.48)
>      2     1.00 [  0.00]( 0.26)     0.96 [ -3.82]( 0.55)
>      4     1.00 [  0.00]( 0.15)     0.95 [ -4.96]( 0.27)
>      8     1.00 [  0.00]( 0.15)     0.99 [ -0.58]( 0.00)
>     16     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
>     32     1.00 [  0.00]( 4.88)     1.04 [  4.27]( 2.42)
>     64     1.00 [  0.00]( 5.57)     0.87 [-13.10](11.51)
>    128     1.00 [  0.00]( 0.34)     0.97 [ -3.13]( 0.58)
>    256     1.00 [  0.00]( 1.95)     1.02 [  1.83]( 0.15)
>    512     1.00 [  0.00]( 0.44)     1.00 [  0.48]( 0.12)
> 
> 
>    ==================================================================
>    Test          : new-schbench-wakeup-latency
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>      1     1.00 [ -0.00]( 4.19)     1.00 [ -0.00](14.91)
>      2     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 0.00)
>      4     1.00 [ -0.00]( 8.91)     0.80 [ 20.00]( 4.43)
>      8     1.00 [ -0.00]( 7.45)     1.00 [ -0.00]( 7.45)
>     16     1.00 [ -0.00]( 4.08)     1.00 [ -0.00](10.79)
>     32     1.00 [ -0.00](16.90)     0.93 [  6.67](10.00)
>     64     1.00 [ -0.00]( 9.11)     1.12 [-12.50]( 0.00)
>    128     1.00 [ -0.00]( 7.05)     2.43 [-142.86](24.47)

OK, this was what I saw too. I'm looking into this.

>    256     1.00 [ -0.00]( 4.32)     1.02 [ -2.34]( 1.20)
>    512     1.00 [ -0.00]( 0.35)     1.01 [ -0.77]( 0.40)
> 
> 
>    ==================================================================
>    Test          : new-schbench-request-latency
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>      1     1.00 [ -0.00]( 0.78)     1.16 [-15.70]( 2.14)
>      2     1.00 [ -0.00]( 0.81)     1.13 [-13.11]( 0.62)
>      4     1.00 [ -0.00]( 0.24)     1.26 [-26.11](16.43)
>      8     1.00 [ -0.00]( 1.30)     1.03 [ -3.46]( 0.81)
>     16     1.00 [ -0.00]( 1.11)     1.02 [ -2.12]( 1.85)
>     32     1.00 [ -0.00]( 5.94)     0.96 [  4.05]( 4.48)
>     64     1.00 [ -0.00]( 6.27)     1.06 [ -6.01]( 6.67)
>    128     1.00 [ -0.00]( 0.21)     1.12 [-12.31]( 2.61)
>    256     1.00 [ -0.00](13.73)     1.06 [ -6.30]( 3.37)
>    512     1.00 [ -0.00]( 0.95)     1.05 [ -4.85]( 0.61)
> 
> 
>    ==================================================================
>    Test          : Various longer running benchmarks
>    Units         : %diff in throughput reported
>    Interpretation: Higher is better
>    Statistic     : Median
>    ==================================================================
>    Benchmarks:                 %diff
>    ycsb-cassandra              -1.21%
>    ycsb-mongodb                -0.69%
> 
>    deathstarbench-1x           -7.40%
>    deathstarbench-2x           -3.80%
>    deathstarbench-3x           -3.99%
>    deathstarbench-6x           -3.02%
> 
>    hammerdb+mysql 16VU         -2.59%
>    hammerdb+mysql 64VU         -1.05%
> 

For long-duration task, the penalty of remote cache access is severe. 
This might indicate a similar issue as tbench/netperf as you mentioned,
different processes are aggregated to different LLCs, but these 
processes interact with each other and WF_SYNC did not take effect.

> 
> Also, could you fold the below diff into your Patch2:
> 

Sure, let me apply it and do the test.

thanks,
Chenyu

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eb5a2572b4f8..6c51dd2b7b32 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7694,8 +7694,6 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, bool
>       int i, cpu, idle_cpu = -1, nr = INT_MAX;
>       struct sched_domain_shared *sd_share;
> 
> -    cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> -
>       if (sched_feat(SIS_UTIL)) {
>           sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
>           if (sd_share) {
> @@ -7707,6 +7705,8 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, bool
>           }
>       }
> 
> +    cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
>       if (static_branch_unlikely(&sched_cluster_active)) {
>           struct sched_group *sg = sd->groups;
> 
> ---
> 
> If the SIS_UTIL cut off hits, that result of the cpumask_and() is of no
> use. To save some additional cycles, especially in cases where we target
> the LLC frequently and the search bails out because the LLC is busy,
> this overhead can be easily avoided. Since select_idle_cpu() can now be
> called twice per wakeup, this overhead can be visible in benchmarks like
> hackbench.
>