[v1] sched/balancing: Misc updates & cleanups

Expand all Fold all

[PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

Posted by Ingo Molnar 1 year, 11 months ago

The 'balancing' spinlock added in:

  08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")

... is taken when the SD_SERIALIZE flag is set in a domain, but in reality it
is a glorified global atomic flag serializing the load-balancing of
those domains.

It doesn't have any explicit locking semantics per se: we just
spin_trylock() it.

Turn it into a ... global atomic flag. This makes it more
clear what is going on here, and reduces overhead and code
size a bit:

  # kernel/sched/fair.o: [x86-64 defconfig]

     text	   data	    bss	    dec	    hex	filename
    60730	   2721	    104	  63555	   f843	fair.o.before
    60718	   2721	    104	  63543	   f837	fair.o.after

Also document the flag a bit.

No change in functionality intended.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Valentin Schneider <vschneid@redhat.com>
---
 kernel/sched/fair.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a16129f9a5c..64ae3d8dc93b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11633,7 +11633,20 @@ static int active_load_balance_cpu_stop(void *data)
 	return 0;
 }
 
-static DEFINE_SPINLOCK(balancing);
+/*
+ * This flag serializes load-balancing passes over large domains
+ * (such as SD_NUMA) - only once load-balancing instance may run
+ * at a time, to reduce overhead on very large systems with lots
+ * of CPUs and large NUMA distances.
+ *
+ * - Note that load-balancing passes triggered while another one
+ *   is executing are skipped and not re-tried.
+ *
+ * - Also note that this does not serialize sched_balance_domains()
+ *   execution, as non-SD_SERIALIZE domains will still be
+ *   load-balanced in parallel.
+ */
+static atomic_t sched_balance_running = ATOMIC_INIT(0);
 
 /*
  * Scale the max load_balance interval with the number of CPUs in the system.
@@ -11711,7 +11724,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 
 		need_serialize = sd->flags & SD_SERIALIZE;
 		if (need_serialize) {
-			if (!spin_trylock(&balancing))
+			if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
 				goto out;
 		}
 
@@ -11729,7 +11742,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 			interval = get_sd_balance_interval(sd, busy);
 		}
 		if (need_serialize)
-			spin_unlock(&balancing);
+			atomic_set_release(&sched_balance_running, 0);
 out:
 		if (time_after(next_balance, sd->last_balance + interval)) {
 			next_balance = sd->last_balance + interval;
-- 
2.40.1

Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

Posted by Shrikanth Hegde 1 year, 11 months ago


On 3/4/24 3:18 PM, Ingo Molnar wrote:
> The 'balancing' spinlock added in:
> 
>   08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")
> 

[...]

> 
>  
> -static DEFINE_SPINLOCK(balancing);
> +/*
> + * This flag serializes load-balancing passes over large domains
> + * (such as SD_NUMA) - only once load-balancing instance may run
> + * at a time, to reduce overhead on very large systems with lots
> + * of CPUs and large NUMA distances.
> + *
> + * - Note that load-balancing passes triggered while another one
> + *   is executing are skipped and not re-tried.
> + *
> + * - Also note that this does not serialize sched_balance_domains()
> + *   execution, as non-SD_SERIALIZE domains will still be
> + *   load-balanced in parallel.
> + */
> +static atomic_t sched_balance_running = ATOMIC_INIT(0);
>  
>  /*

Continuing the discussion related whether this balancing lock is 
contended or not. 


It was observed in large system (1920CPU, 16 NUMA Nodes) cacheline containing the 
balancing trylock was contended and rebalance_domains was seen as part of the traces. 

So did some experiments on smaller system. This system as 224 CPUs and 6 NUMA nodes.
Added probe points in rebalance_domains. If lock is not contended, then lock should
success and both probe points should match. If not, there should be contention. 
Below are the system details and perf probe -L rebalance_domains.

NUMA:                    
  NUMA node(s):          6
  NUMA node0 CPU(s):     0-31
  NUMA node1 CPU(s):     32-71
  NUMA node4 CPU(s):     72-111
  NUMA node5 CPU(s):     112-151
  NUMA node6 CPU(s):     152-183
  NUMA node7 CPU(s):     184-223


------------------------------------------------------------------------------------------------------------------
#perf probe -L rebalance_domains
<rebalance_domains@/shrikanth/sched_tip/kernel/sched/fair.c:0>
      0  static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
         {
      2         int continue_balancing = 1;
      3         int cpu = rq->cpu;
[...]


     33                 interval = get_sd_balance_interval(sd, busy);

                        need_serialize = sd->flags & SD_SERIALIZE;
     36                 if (need_serialize) {
     37                         if (!spin_trylock(&balancing))
                                        goto out;
                        }

     41                 if (time_after_eq(jiffies, sd->last_balance + interval)) {
     42                         if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
                                        /*
                                         * The LBF_DST_PINNED logic could have changed
                                         * env->dst_cpu, so we can't know our idle
                                         * state even if we migrated tasks. Update it.
                                         */
     48                                 idle = idle_cpu(cpu) ? CPU_IDLE : CPU_NOT_IDLE;
     49                                 busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
                                }
     51                         sd->last_balance = jiffies;
     52                         interval = get_sd_balance_interval(sd, busy);
                        }
     54                 if (need_serialize)
     55                         spin_unlock(&balancing);
         out:
     57                 if (time_after(next_balance, sd->last_balance + interval)) {
                                next_balance = sd->last_balance + interval;
                                update_next_balance = 1;
                        }
                }

perf probe --list
  probe:rebalance_domains_L37 (on rebalance_domains+856)
  probe:rebalance_domains_L55 (on rebalance_domains+904)
------------------------------------------------------------------------------------------------------------------

Perf records are collected for 10 seconds in different system loads. load is created using stress-ng. 
Contention is calculated as (1-L55/L37)*100

system is idle:  		<--	No contention
1K probe:rebalance_domains_L37
1K probe:rebalance_domains_L55


system is at 25% loa: 		<-- 	4.4% contention
223K probe:rebalance_domains_L37: 1 chunks LOST!
213K probe:rebalance_domains_L55: 1 chunks LOST!



system is at 50% load		<--	12.5% contention
168K probe:rebalance_domains_L37
147K probe:rebalance_domains_L55


system is at 75% load		<-- 	25.6% contention
113K probe:rebalance_domains_L37
84K probe:rebalance_domains_L55

87
system is at 100% load		<--	87.5% contention.
64K probe:rebalance_domains_L37
8K probe:rebalance_domains_L55


A few reasons for contentions could be: 
1. idle load balance is running and some other cpu is becoming idle, and tries newidle_balance. 
2. when system is busy, every CPU would do busy balancing, it would contend for the lock. It will not do balance as 
   should_we_balance says this CPU need not balance. It bails out and release the lock.

Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

Posted by Ingo Molnar 1 year, 11 months ago

* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> system is at 75% load		<-- 	25.6% contention
> 113K probe:rebalance_domains_L37
> 84K probe:rebalance_domains_L55
> 
> 87
> system is at 100% load		<--	87.5% contention.
> 64K probe:rebalance_domains_L37
> 8K probe:rebalance_domains_L55
> 
> 
> A few reasons for contentions could be: 
>
> 1. idle load balance is running and some other cpu is becoming idle, and 
>    tries newidle_balance.
>
> 2. when system is busy, every CPU would do busy balancing, it would 
>    contend for the lock. It will not do balance as should_we_balance says 
>    this CPU need not balance. It bails out and release the lock.

Thanks, these measurements are really useful!

Would it be possible to disambiguate these two cases?

I think we should probably do something about this contention on this large 
system: especially if #2 'no work to be done' bailout is the common case.

Thanks,

	Ingo

Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

Posted by Shrikanth Hegde 1 year, 11 months ago


On 3/8/24 4:53 PM, Ingo Molnar wrote:
> 
> * Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> 
>> system is at 75% load		<-- 	25.6% contention
>> 113K probe:rebalance_domains_L37
>> 84K probe:rebalance_domains_L55
>>
>> 87
>> system is at 100% load		<--	87.5% contention.
>> 64K probe:rebalance_domains_L37
>> 8K probe:rebalance_domains_L55
>>
>>
>> A few reasons for contentions could be: 
>>
>> 1. idle load balance is running and some other cpu is becoming idle, and 
>>    tries newidle_balance.
>>
>> 2. when system is busy, every CPU would do busy balancing, it would 
>>    contend for the lock. It will not do balance as should_we_balance says 
>>    this CPU need not balance. It bails out and release the lock.
> 
> Thanks, these measurements are really useful!
> 
> Would it be possible to disambiguate these two cases?

I think its not case 1, since newidle_balance doesnt even take that lock. So 
likely its case 2. 

> 
> I think we should probably do something about this contention on this large 
> system: especially if #2 'no work to be done' bailout is the common case.
> 


I have been thinking would it be right to move this balancing trylock/atomic after 
should_we_balance(swb). This does reduce the number of times this checked/updated 
significantly. Contention is still present. That's possible at higher utilization 
when there are multiple NUMA domains. one CPU in each NUMA domain can contend if their invocation
is aligned. 

That makes sense since, Right now a CPU takes lock, checks if it can balance, do balance if yes and
 then releases the lock. If the lock is taken after swb then also, CPU checks if it can balance, 
tries to take the lock and releases the lock if it did. If lock is contended, it bails out of 
load_balance. That is the current behaviour as well, or I am completely wrong. 

Not sure in which scenarios that would hurt. we could do this after this series. 
This may need wider functional testing to make sure we don't regress badly in some cases. 
This is only an *idea* as of now. 

Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system. 
6.8-rc6                                                                         
-----------------------------------------                                       
idle system:                                                                    
449 probe:rebalance_domains_L37                                                 
377 probe:rebalance_domains_L55                                                 
stress-ng --cpu=$(nproc) -l 51     << 51% load                                               
88K probe:rebalance_domains_L37                                                 
77K probe:rebalance_domains_L55                                                 
stress-ng --cpu=$(nproc) -l 100    << 100% load                                             
41K probe:rebalance_domains_L37                                                 
10K probe:rebalance_domains_L55                                                 
                                                                                
+below patch                                                                          
----------------------------------------                                        
idle system:                                                                    
462 probe:load_balance_L35                                                      
394 probe:load_balance_L274                                                     
stress-ng --cpu=$(nproc) -l 51      << 51% load                                            
5K probe:load_balance_L35                       	<<-- almost 15x less                                
4K probe:load_balance_L274                                                      
stress-ng --cpu=$(nproc) -l 100     << 100% load                                            
8K probe:load_balance_L35                                                       
3K probe:load_balance_L274 				<<-- almost 4x less


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 62f247bdec86..3a8de7454377 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11272,6 +11272,7 @@ static int should_we_balance(struct lb_env *env)
 	return group_balance_cpu(sg) == env->dst_cpu;
 }

+static DEFINE_SPINLOCK(balancing);
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	struct rq *busiest;
 	struct rq_flags rf;
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+	int need_serialize;
 	struct lb_env env = {
 		.sd		= sd,
 		.dst_cpu	= this_cpu,
@@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		goto out_balanced;
 	}

+	need_serialize = sd->flags & SD_SERIALIZE;
+	if (need_serialize) {
+		if (!spin_trylock(&balancing))
+			goto lockout;
+	}
+
 	group = find_busiest_group(&env);
 	if (!group) {
 		schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11434,6 +11442,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
+				if (need_serialize)
+					spin_unlock(&balancing);
 				goto redo;
 			}
 			goto out_all_pinned;
@@ -11540,7 +11550,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	     sd->balance_interval < MAX_PINNED_INTERVAL) ||
 	    sd->balance_interval < sd->max_interval)
 		sd->balance_interval *= 2;
+
 out:
+	if (need_serialize)
+		spin_unlock(&balancing);
+
+lockout:
 	return ld_moved;
 }

@@ -11665,7 +11680,6 @@ static int active_load_balance_cpu_stop(void *data)
 	return 0;
 }

-static DEFINE_SPINLOCK(balancing);

 /*
  * Scale the max load_balance interval with the number of CPUs in the system.
@@ -11716,7 +11730,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
 	int update_next_balance = 0;
-	int need_serialize, need_decay = 0;
+	int need_decay = 0;
 	u64 max_cost = 0;

 	rcu_read_lock();
@@ -11741,12 +11755,6 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)

 		interval = get_sd_balance_interval(sd, busy);

-		need_serialize = sd->flags & SD_SERIALIZE;
-		if (need_serialize) {
-			if (!spin_trylock(&balancing))
-				goto out;
-		}
-
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
 			if (load_balance(cpu, rq, sd, idle, &continue_balancing)) {
 				/*
@@ -11760,9 +11768,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 			sd->last_balance = jiffies;
 			interval = get_sd_balance_interval(sd, busy);
 		}
-		if (need_serialize)
-			spin_unlock(&balancing);
-out:
+
 		if (time_after(next_balance, sd->last_balance + interval)) {
 			next_balance = sd->last_balance + interval;
 			update_next_balance = 1;

Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

Posted by Ingo Molnar 1 year, 11 months ago

* Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> > I think we should probably do something about this contention on this 
> > large system: especially if #2 'no work to be done' bailout is the 
> > common case.
> 
> 
> I have been thinking would it be right to move this balancing 
> trylock/atomic after should_we_balance(swb). This does reduce the number 
> of times this checked/updated significantly. Contention is still present. 
> That's possible at higher utilization when there are multiple NUMA 
> domains. one CPU in each NUMA domain can contend if their invocation is 
> aligned.

Note that it's not true contention: it simply means there's overlapping 
requests for the highest domains to be balanced, for which we only have a 
single thread of execution at a time, system-wide.

> That makes sense since, Right now a CPU takes lock, checks if it can 
>  balance, do balance if yes and then releases the lock. If the lock is 
>  taken after swb then also, CPU checks if it can balance,
> tries to take the lock and releases the lock if it did. If lock is 
> contended, it bails out of load_balance. That is the current behaviour as 
> well, or I am completely wrong.
> 
> Not sure in which scenarios that would hurt. we could do this after this 
> series. This may need wider functional testing to make sure we don't 
> regress badly in some cases. This is only an *idea* as of now.
> 
> Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system. 
> 6.8-rc6                                                                         
> -----------------------------------------                                       
> idle system:                                                                    
> 449 probe:rebalance_domains_L37                                                 
> 377 probe:rebalance_domains_L55                                                 
> stress-ng --cpu=$(nproc) -l 51     << 51% load                                               
> 88K probe:rebalance_domains_L37                                                 
> 77K probe:rebalance_domains_L55                                                 
> stress-ng --cpu=$(nproc) -l 100    << 100% load                                             
> 41K probe:rebalance_domains_L37                                                 
> 10K probe:rebalance_domains_L55                                                 
>                                                                                 
> +below patch                                                                          
> ----------------------------------------                                        
> idle system:                                                                    
> 462 probe:load_balance_L35                                                      
> 394 probe:load_balance_L274                                                     
> stress-ng --cpu=$(nproc) -l 51      << 51% load                                            
> 5K probe:load_balance_L35                       	<<-- almost 15x less                                
> 4K probe:load_balance_L274                                                      
> stress-ng --cpu=$(nproc) -l 100     << 100% load                                            
> 8K probe:load_balance_L35                                                       
> 3K probe:load_balance_L274 				<<-- almost 4x less

That's nice.

> +static DEFINE_SPINLOCK(balancing);
>  /*
>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>   * tasks if there is an imbalance.
> @@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  	struct rq *busiest;
>  	struct rq_flags rf;
>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
> +	int need_serialize;
>  	struct lb_env env = {
>  		.sd		= sd,
>  		.dst_cpu	= this_cpu,
> @@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  		goto out_balanced;
>  	}
> 
> +	need_serialize = sd->flags & SD_SERIALIZE;
> +	if (need_serialize) {
> +		if (!spin_trylock(&balancing))
> +			goto lockout;
> +	}
> +
>  	group = find_busiest_group(&env);

So if I'm reading your patch right, the main difference appears to be that 
it allows the should_we_balance() check to be executed in parallel, and 
will only try to take the NUMA-balancing flag if that function indicates an 
imbalance.

Since should_we_balance() isn't taking any locks AFAICS, this might be a 
valid approach. What might make sense is to instrument the percentage of 
NUMA-balancing flag-taking 'failures' vs. successful attempts - not 
necessarily the 'contention percentage'.

But another question is, why do we get here so frequently, so that the 
cumulative execution time of these SD_SERIAL rebalance passes exceeds that 
of 100% of single CPU time? Ie. a single CPU is basically continuously 
scanning the scheduler data structures for imbalances, right? That doesn't 
seem natural even with just ~224 CPUs.

Alternatively, is perhaps the execution time of the SD_SERIAL pass so large 
that we exceed 100% CPU time?

Thanks,

	Ingo

Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

Posted by Shrikanth Hegde 1 year, 10 months ago


On 3/12/24 4:27 PM, Ingo Molnar wrote:
> 
> * Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> 
>> I have been thinking would it be right to move this balancing 
>> trylock/atomic after should_we_balance(swb). This does reduce the number 
>> of times this checked/updated significantly. Contention is still present. 
>> That's possible at higher utilization when there are multiple NUMA 
>> domains. one CPU in each NUMA domain can contend if their invocation is 
>> aligned.
> 
> Note that it's not true contention: it simply means there's overlapping 
> requests for the highest domains to be balanced, for which we only have a 
> single thread of execution at a time, system-wide.
> 

Hi Ingo, Vincent, Peter,

I did some more experiments and collected quite a bit of data. 
Some interesting observations.

tl;dr

1. For BUSY and IDLE type load balancing, contention arises since multiple CPU's
are trying to set the flag at the same time. Only one succeeds and tries to do 
load balance. Whether it can do load balance or not is decided later in should_we_balance(swb). 
contention increases with system utilization. Why contention happens is simpler for BUSY, if their 
SOFTIRQ's are aligned. If the busy is invoked(but may prefer idle) while an IDLE CPU is trying is 
scenario when utilization is low. 

2. NEWIDLE does not even check for sched_balance_running and it does not honor 
continue_balancing flag either. should_we_balance for NEWIDLE is mostly return true anyway. 
So NEWIDLE balance doesn't honor the SD_SERIALIZE flag. It does go through all the rq even at the 
highest NUMA Domains often, and pulls task opportunistically. cost of balancing at NUMA is costly 
as it would need to fetch all rq.

>> That makes sense since, Right now a CPU takes lock, checks if it can 
>>  balance, do balance if yes and then releases the lock. If the lock is 
>>  taken after swb then also, CPU checks if it can balance,
>> tries to take the lock and releases the lock if it did. If lock is 
>> contended, it bails out of load_balance. That is the current behaviour as 
>> well, or I am completely wrong.
>>
>> Perf probes at spin_trylock and spin_unlock codepoints on the same 224CPU, 6 NUMA node system. 
>> 6.8-rc6                                                                         
>> -----------------------------------------                                       
>> idle system:                                                                    
>> 449 probe:rebalance_domains_L37                                                 
>> 377 probe:rebalance_domains_L55                                                 
>> stress-ng --cpu=$(nproc) -l 51     << 51% load                                               
>> 88K probe:rebalance_domains_L37                                                 
>> 77K probe:rebalance_domains_L55                                                 
>> stress-ng --cpu=$(nproc) -l 100    << 100% load                                             
>> 41K probe:rebalance_domains_L37                                                 
>> 10K probe:rebalance_domains_L55                                                 
>>                                                                                 
>> +below patch                                                                          
>> ----------------------------------------                                        
>> idle system:                                                                    
>> 462 probe:load_balance_L35                                                      
>> 394 probe:load_balance_L274                                                     
>> stress-ng --cpu=$(nproc) -l 51      << 51% load                                            
>> 5K probe:load_balance_L35                       	<<-- almost 15x less                                
>> 4K probe:load_balance_L274                                                      
>> stress-ng --cpu=$(nproc) -l 100     << 100% load                                            
>> 8K probe:load_balance_L35                                                       
>> 3K probe:load_balance_L274 				<<-- almost 4x less
> 
> That's nice.
> 

These numbers completely go crazy for schbench or hackbench if we move the sched_balance_running 
after should_we_balance due to NEWIDLE. we could add a idle type check before doing operations on
sched_balance_running.   


>> +static DEFINE_SPINLOCK(balancing);
>>  /*
>>   * Check this_cpu to ensure it is balanced within domain. Attempt to move
>>   * tasks if there is an imbalance.
>> @@ -11286,6 +11287,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>  	struct rq *busiest;
>>  	struct rq_flags rf;
>>  	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
>> +	int need_serialize;
>>  	struct lb_env env = {
>>  		.sd		= sd,
>>  		.dst_cpu	= this_cpu,
>> @@ -11308,6 +11310,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>  		goto out_balanced;
>>  	}
>>
>> +	need_serialize = sd->flags & SD_SERIALIZE;
>> +	if (need_serialize) {
>> +		if (!spin_trylock(&balancing))
>> +			goto lockout;
>> +	}
>> +
>>  	group = find_busiest_group(&env);
> 
> So if I'm reading your patch right, the main difference appears to be that 
> it allows the should_we_balance() check to be executed in parallel, and 
> will only try to take the NUMA-balancing flag if that function indicates an 
> imbalance.

Yes for BUSY and IDLE balancing. 

> 
> Since should_we_balance() isn't taking any locks AFAICS, this might be a 
> valid approach. What might make sense is to instrument the percentage of 
> NUMA-balancing flag-taking 'failures' vs. successful attempts - not 
> necessarily the 'contention percentage'.

So i put a bunch of trace_printk and added a hacky check similar to swb if taking flag fails. 
I will put the numbers for schebench, hackbench and stress-ng -l 100 for initial reference. 

I ran these a slightly larger system this time.  I have two NUMA levels. domain 3 has two groups 0-159, 160-319.
Sorry in case the data isnt easy to read. I couldnt find a better way. some rounding errors are there, but mostly 
sum where applicable is close. trace is collected for 5 seconds.

  NUMA node(s):           4
  NUMA node0 CPU(s):      0-79
  NUMA node1 CPU(s):      80-159
  NUMA node6 CPU(s):      160-239
  NUMA node7 CPU(s):      240-319

more domain*/flags | cat
::::::::::::::
domain0/flags
::::::::::::::
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING 
::::::::::::::
domain1/flags
::::::::::::::
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING 
::::::::::::::
domain2/flags
::::::::::::::
SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING 
::::::::::::::
domain3/flags
::::::::::::::
SD_BALANCE_NEWIDLE SD_SERIALIZE SD_OVERLAP SD_NUMA 
::::::::::::::
domain4/flags
::::::::::::::
SD_BALANCE_NEWIDLE SD_SERIALIZE SD_OVERLAP SD_NUMA 

================================================================= BUSY balancing ===============================================
		   tried_access_to	|	  	failed			|		aquired(number of sched_balance_rq)  <<-- aquired need not call sched_balance_rq due to time check.
		sched_balance_running	|		to_aquire 		|
					|     swb true?		swb_false?    	|      swb_true?    swb_false?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
					|    					|	
hackbench   	       254727		|		236268			|		18508(170)		
10 groups 				|	38     <--|-->	236230		| 	0	<--|-->	170      
					|					|
schbench	       217794		|		207117			|		10683(659)
320 groups				| 	1059   <--|-->  206058 		|	1	<--|-->  658
					|					|
stress-ng 		31646		|		 27430			|		4216(523)	
-l100					|	272	<--|-->	 27158		|	6	<--|-->	517
					|					|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Observations: 
  * When the flag was not aquired it was mostly that CPU wouldnt have done the balancing. Good.
  * When the flag was aquired, mostly that CPU didn't do load balancing. It bailed out since swb was false. Bad. 


=============================================================== IDLE Balancing ==================================================
		   tried_access_to	|	  	failed			|		aquired(number of sched_balance_rq)  <<-- aquired need not call sched_balance_rq due to time check.
		sched_balance_running	|		to_aquire 		|
					|     swb true?		swb_false?    	|      swb_true?    swb_false?
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

hackbench		11022		| 		7309			|		 3664(2635)
10 groups				|	1322   <--|-->  5987		| 	753	<--|-->	1882 
					|					|
schbench		 3018		|		2305			|		  707(632)
320 groups				|	1036   <--|-->	1269		|	268     <--|--> 364
					|					|
stress-ng 		    0		|		   0			|		    0
-l 100					|					|
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Observations: 
 * When the flag was not aquired, it was good chance that CPU would have done the balance.  bad
 * When the flag is taken, many times it didnt do balance since swb was false.  bad. 

> 
> But another question is, why do we get here so frequently, so that the 
> cumulative execution time of these SD_SERIAL rebalance passes exceeds that 
> of 100% of single CPU time? Ie. a single CPU is basically continuously 
> scanning the scheduler data structures for imbalances, right? That doesn't 
> seem natural even with just ~224 CPUs.
> 

SD_SERIALIZE passes are costly. Doing a pass through 0-319 takes 20-30us and Doing 
a pass through 0-159 takes around 10-20us.  NEWIDLE can be called very often and it 
doesnt serialize and loops through all domains aggressively.  NEWIDLE balancing 
successfully pull a task around 3-5% of the time. 


For reference, the running the same hackbench, Combined it would take 12 seconds on 320 CPU system 
where trace is collected for 5 seconds. Thats approx 0.7% cycles. 
domain_cost avg: 33031.77 times: 172991  
domain_cost avg: 15615.21 times: 211783
domain_cost avg: 15359.40 times: 208164

Similarly while running schbench, it takes approx 0.12% of total time. 

> Alternatively, is perhaps the execution time of the SD_SERIAL pass so large 
> that we exceed 100% CPU time?
> 

So Main questions i have are: 
1. Should NEWIDLE honor SD_SERIALIZE ? This would make SD_SERIALIZE defined correctly, 
all three types honor it. Currenly on BUSY,IDLE honors it.
2. Should we move taking sched_balance_running after swb atleast for BUSY, IDLE? 

================================================================================================


debug patch used: 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c8e50fbac345..cb824c327ab6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11172,6 +11172,56 @@ static int need_active_balance(struct lb_env *env)
 
 static int active_load_balance_cpu_stop(void *data);
 
+static int hack_check_if_i_can_balance(struct sched_domain *sd, int this_cpu) {
+	struct cpumask *swb_cpus = this_cpu_cpumask_var_ptr(should_we_balance_tmpmask);
+	struct sched_group *sg = sd->groups;
+	int cpu, idle_smt = -1;
+	struct cpumask *cpus = this_cpu_cpumask_var_ptr(load_balance_mask);
+
+	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+	if (!cpumask_test_cpu(this_cpu, cpus))
+		return 0;
+
+	cpumask_copy(swb_cpus, group_balance_mask(sg));
+	for_each_cpu_and(cpu, swb_cpus, cpus) {
+		if (!idle_cpu(cpu))
+			continue;
+
+		/*
+		 * Don't balance to idle SMT in busy core right away when
+		 * balancing cores, but remember the first idle SMT CPU for
+		 * later consideration.  Find CPU on an idle core first.
+		 */
+		if (!(sd->flags & SD_SHARE_CPUCAPACITY) && !is_core_idle(cpu)) {
+			if (idle_smt == -1)
+				idle_smt = cpu;
+			/*
+			 * If the core is not idle, and first SMT sibling which is
+			 * idle has been found, then its not needed to check other
+			 * SMT siblings for idleness:
+			 */
+#ifdef CONFIG_SCHED_SMT
+			cpumask_andnot(swb_cpus, swb_cpus, cpu_smt_mask(cpu));
+#endif
+			continue;
+		}
+
+		/*
+		 * Are we the first idle core in a non-SMT domain or higher,
+		 * or the first idle CPU in a SMT domain?
+		 */
+		return cpu == this_cpu;
+	}
+
+	/* Are we the first idle CPU with busy siblings? */
+	if (idle_smt != -1)
+		return idle_smt == this_cpu;
+
+	/* Are we the first CPU of this group ? */
+	return group_balance_cpu(sg) == this_cpu;
+}
+
+
 static int should_we_balance(struct lb_env *env)
 {
 	struct cpumask *swb_cpus = this_cpu_cpumask_var_ptr(should_we_balance_tmpmask);
@@ -11267,13 +11317,22 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
+	if (sd->flags & SD_SERIALIZE)
+		trace_printk("Trying load_balance_rq for idle: %d span: %*pbl\n", idle,cpumask_pr_args(sched_domain_span(sd)));
 
 redo:
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
+		if (sd->flags & SD_SERIALIZE) {
+			trace_printk("swb says this cpu doesnt need to balance idle: %d span: %*pbl\n", idle, cpumask_pr_args(sched_domain_span(sd)));
+		}
 		goto out_balanced;
 	}
 
+	if (sd->flags & SD_SERIALIZE) {
+		trace_printk("This cpu would try balance idle: %d span: %*pbl\n", idle, cpumask_pr_args(sched_domain_span(sd)));
+	}
+
 	group = sched_balance_find_src_group(&env);
 	if (!group) {
 		schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11288,6 +11347,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 
 	WARN_ON_ONCE(busiest == env.dst_rq);
 
+	if (sd->flags & SD_SERIALIZE) {
+		trace_printk("There is some imbalance idle: %d span: %*pbl\n", idle, cpumask_pr_args(sched_domain_span(sd)));
+	}
 	schedstat_add(sd->lb_imbalance[idle], env.imbalance);
 
 	env.src_cpu = busiest->cpu;
@@ -11507,6 +11569,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	    sd->balance_interval < sd->max_interval)
 		sd->balance_interval *= 2;
 out:
+	if ((sd->flags & SD_SERIALIZE) && ld_moved)
+		trace_printk("load balance was successful idle: %d span: %*pbl\n", idle, cpumask_pr_args(sched_domain_span(sd)));
 	return ld_moved;
 }
 
@@ -11722,8 +11786,13 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 
 		need_serialize = sd->flags & SD_SERIALIZE;
 		if (need_serialize) {
-			if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
+			int result;
+			trace_printk("try to set sched_balance_running idle: %d\n", idle);
+			if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
+				result = hack_check_if_i_can_balance(sd, cpu);
+				trace_printk("sched_balance_running  was not aquird idle: %d was_swb: %d\n", idle, result);
 				goto out;
+			}
 		}
 
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
@@ -11739,8 +11808,10 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 			sd->last_balance = jiffies;
 			interval = get_sd_balance_interval(sd, busy);
 		}
-		if (need_serialize)
+		if (need_serialize) {
+			trace_printk("Release sched_balance_running idle: %d\n", idle);
 			atomic_set_release(&sched_balance_running, 0);
+		}
 out:
 		if (time_after(next_balance, sd->last_balance + interval)) {
 			next_balance = sd->last_balance + interval;
@@ -12347,6 +12418,10 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 		u64 domain_cost;
 
 		update_next_balance(sd, &next_balance);
+		if (sd->flags & SD_SERIALIZE) {
+			trace_printk("newidle balance called span: %*pbl,  rq_avg: %llu, curr_cost: %llu, max_idle_cost: %llu\n",
+					cpumask_pr_args(sched_domain_span(sd)), this_rq->avg_idle, curr_cost, sd->max_newidle_lb_cost);
+		}
 
 		if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost)
 			break;
@@ -12360,6 +12435,9 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 			t1 = sched_clock_cpu(this_cpu);
 			domain_cost = t1 - t0;
 			update_newidle_cost(sd, domain_cost);
+			if (sd->flags & SD_SERIALIZE)
+				trace_printk("cost of load balance,span: %*pbl,  pulled_task: %d, domain_cost: %llu\n",
+						cpumask_pr_args(sched_domain_span(sd)), pulled_task, domain_cost);
 
 			curr_cost += domain_cost;
 			t0 = t1;

Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

Posted by Valentin Schneider 1 year, 11 months ago

On 04/03/24 10:48, Ingo Molnar wrote:
> The 'balancing' spinlock added in:
>
>   08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")
>
> ... is taken when the SD_SERIALIZE flag is set in a domain, but in reality it
> is a glorified global atomic flag serializing the load-balancing of
> those domains.
>
> It doesn't have any explicit locking semantics per se: we just
> spin_trylock() it.
>
> Turn it into a ... global atomic flag. This makes it more
> clear what is going on here, and reduces overhead and code
> size a bit:
>
>   # kernel/sched/fair.o: [x86-64 defconfig]
>
>      text	   data	    bss	    dec	    hex	filename
>     60730	   2721	    104	  63555	   f843	fair.o.before
>     60718	   2721	    104	  63543	   f837	fair.o.after
>
> Also document the flag a bit.
>
> No change in functionality intended.
>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Valentin Schneider <vschneid@redhat.com>

Few comment nits, otherwise:

Reviewed-by: Valentin Schneider <vschneid@redhat.com>

> ---
>  kernel/sched/fair.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6a16129f9a5c..64ae3d8dc93b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11633,7 +11633,20 @@ static int active_load_balance_cpu_stop(void *data)
>       return 0;
>  }
>
> -static DEFINE_SPINLOCK(balancing);
> +/*
> + * This flag serializes load-balancing passes over large domains
> + * (such as SD_NUMA) - only once load-balancing instance may run
                               ^^^^
                           s/once/one/

Also, currently the flag is only set for domains above the NODE topology
level, sd_init() will reject an architecture that forces SD_SERIALIZE in a
topology level's ->sd_flags(), so what about:

s/(such as SD_NUMA)/(above the NODE topology level)

> + * at a time, to reduce overhead on very large systems with lots
> + * of CPUs and large NUMA distances.
> + *
> + * - Note that load-balancing passes triggered while another one
> + *   is executing are skipped and not re-tried.
> + *
> + * - Also note that this does not serialize sched_balance_domains()
                                               ^^^^^^^^^^^^^^^^^^^^^
Did you mean rebalance_domains()?

> + *   execution, as non-SD_SERIALIZE domains will still be
> + *   load-balanced in parallel.
> + */
> +static atomic_t sched_balance_running = ATOMIC_INIT(0);
>
>  /*
>   * Scale the max load_balance interval with the number of CPUs in the system.
> @@ -11711,7 +11724,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>
>               need_serialize = sd->flags & SD_SERIALIZE;
>               if (need_serialize) {
> -			if (!spin_trylock(&balancing))
> +			if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>                               goto out;
>               }
>
> @@ -11729,7 +11742,7 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>                       interval = get_sd_balance_interval(sd, busy);
>               }
>               if (need_serialize)
> -			spin_unlock(&balancing);
> +			atomic_set_release(&sched_balance_running, 0);
>  out:
>               if (time_after(next_balance, sd->last_balance + interval)) {
>                       next_balance = sd->last_balance + interval;
> --
> 2.40.1

Re: [PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag

Posted by Ingo Molnar 1 year, 11 months ago

* Valentin Schneider <vschneid@redhat.com> wrote:

> On 04/03/24 10:48, Ingo Molnar wrote:
> > The 'balancing' spinlock added in:
> >
> >   08c183f31bdb ("[PATCH] sched: add option to serialize load balancing")
> >
> > ... is taken when the SD_SERIALIZE flag is set in a domain, but in reality it
> > is a glorified global atomic flag serializing the load-balancing of
> > those domains.
> >
> > It doesn't have any explicit locking semantics per se: we just
> > spin_trylock() it.
> >
> > Turn it into a ... global atomic flag. This makes it more
> > clear what is going on here, and reduces overhead and code
> > size a bit:
> >
> >   # kernel/sched/fair.o: [x86-64 defconfig]
> >
> >      text	   data	    bss	    dec	    hex	filename
> >     60730	   2721	    104	  63555	   f843	fair.o.before
> >     60718	   2721	    104	  63543	   f837	fair.o.after
> >
> > Also document the flag a bit.
> >
> > No change in functionality intended.
> >
> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Vincent Guittot <vincent.guittot@linaro.org>
> > Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Valentin Schneider <vschneid@redhat.com>
> 
> Few comment nits, otherwise:
> 
> Reviewed-by: Valentin Schneider <vschneid@redhat.com>

Thanks!

> > -static DEFINE_SPINLOCK(balancing);
> > +/*
> > + * This flag serializes load-balancing passes over large domains
> > + * (such as SD_NUMA) - only once load-balancing instance may run
>                                ^^^^
>                            s/once/one/
> 
> Also, currently the flag is only set for domains above the NODE topology
> level, sd_init() will reject an architecture that forces SD_SERIALIZE in a
> topology level's ->sd_flags(), so what about:
> 
> s/(such as SD_NUMA)/(above the NODE topology level)

Agreed & done.

> > + * at a time, to reduce overhead on very large systems with lots
> > + * of CPUs and large NUMA distances.
> > + *
> > + * - Note that load-balancing passes triggered while another one
> > + *   is executing are skipped and not re-tried.
> > + *
> > + * - Also note that this does not serialize sched_balance_domains()
>                                                ^^^^^^^^^^^^^^^^^^^^^
> Did you mean rebalance_domains()?

Correct, a later rename that unifies the nomenclature of all the 
rebalancing functions along the sched_balance_*() prefix that I have not 
posted yet crept back into this comment block, but obviously this patch 
should refer to the current namespace. Fixed.

Thanks,

	Ingo

[PATCH 1/9] sched/balancing: Switch the 'DEFINE_SPINLOCK(balancing)' spinlock into an 'atomic_t sched_balance_running' flag
[PATCH 2/9] sched/balancing: Remove reliance on 'enum cpu_idle_type' ordering when iterating [CPU_MAX_IDLE_TYPES] arrays in show_schedstat()
[PATCH 3/9] sched/balancing: Change 'enum cpu_idle_type' to have more natural definitions
[PATCH 4/9] sched/balancing: Change comment formatting to not overlap Git conflict marker lines
[PATCH 5/9] sched/balancing: Fix comments (trying to) refer to NOHZ_BALANCE_KICK
[PATCH 6/9] sched/balancing: Update run_rebalance_domains() comments
[PATCH 7/9] sched/balancing: Vertically align the comments of 'struct sg_lb_stats' and 'struct sd_lb_stats'
[PATCH 8/9] sched/balancing: Update comments in 'struct sg_lb_stats' and 'struct sd_lb_stats'
[PATCH 9/9] sched/balancing: Rename run_rebalance_domains() => sched_balance_softirq()