sched: improve task_mm_cid_work impact on isolated systems

[PATCH 1/2] sched: Optimise task_mm_cid_work duration

Posted by Gabriele Monaco 1 year, 2 months ago

The current behaviour of task_mm_cid_work is to loop through all
possible CPUs twice to clean up old mm_cid remotely, this can be a waste
of resources especially on tasks with a CPU affinity.

This patch reduces the CPUs involved in the remote CID cleanup carried
on by task_mm_cid_work.

Using the mm_cidmask for the remote cleanup can considerably reduce the
function runtime in highly isolated environments, where each process has
affinity to a single core.  Likewise, in the worst case, the mask is
equivalent to all possible CPUs and we don't see any difference with the
current behaviour.

Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
 kernel/sched/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 95e40895a519..57b50b5952fa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10553,14 +10553,14 @@ static void task_mm_cid_work(struct callback_head *work)
 		return;
 	cidmask = mm_cidmask(mm);
 	/* Clear cids that were not recently used. */
-	for_each_possible_cpu(cpu)
+	for_each_cpu_from(cpu, cidmask)
 		sched_mm_cid_remote_clear_old(mm, cpu);
 	weight = cpumask_weight(cidmask);
 	/*
 	 * Clear cids that are greater or equal to the cidmask weight to
 	 * recompact it.
 	 */
-	for_each_possible_cpu(cpu)
+	for_each_cpu_from(cpu, cidmask)
 		sched_mm_cid_remote_clear_weight(mm, cpu, weight);
 }
 
-- 
2.47.0

Re: [PATCH 1/2] sched: Optimise task_mm_cid_work duration

Posted by kernel test robot 1 year, 2 months ago

Hi Gabriele,

kernel test robot noticed the following build warnings:

[auto build test WARNING on e70140ba0d2b1a30467d4af6bcfe761327b9ec95]

url:    https://github.com/intel-lab-lkp/linux/commits/Gabriele-Monaco/sched-Optimise-task_mm_cid_work-duration/20241203-003033
base:   e70140ba0d2b1a30467d4af6bcfe761327b9ec95
patch link:    https://lore.kernel.org/r/20241202140735.56368-2-gmonaco%40redhat.com
patch subject: [PATCH 1/2] sched: Optimise task_mm_cid_work duration
config: arm64-randconfig-004-20241203 (https://download.01.org/0day-ci/archive/20241203/202412031029.B7lLh63F-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 592c0fe55f6d9a811028b5f3507be91458ab2713)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241203/202412031029.B7lLh63F-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412031029.B7lLh63F-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from kernel/sched/core.c:10:
   In file included from include/linux/highmem.h:8:
   In file included from include/linux/cacheflush.h:5:
   In file included from arch/arm64/include/asm/cacheflush.h:11:
   In file included from include/linux/kgdb.h:19:
   In file included from include/linux/kprobes.h:28:
   In file included from include/linux/ftrace.h:13:
   In file included from include/linux/kallsyms.h:13:
   In file included from include/linux/mm.h:2223:
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
>> kernel/sched/core.c:10552:6: warning: variable 'cpu' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
    10552 |         if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan))
          |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/core.c:10556:20: note: uninitialized use occurs here
    10556 |         for_each_cpu_from(cpu, cidmask)
          |                           ^~~
   include/linux/cpumask.h:391:24: note: expanded from macro 'for_each_cpu_from'
     391 |         for_each_set_bit_from(cpu, cpumask_bits(mask), small_cpumask_bits)
         |                               ^~~
   include/linux/find.h:605:48: note: expanded from macro 'for_each_set_bit_from'
     605 |         for (; (bit) = find_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++)
         |                                                       ^~~
   kernel/sched/core.c:10552:2: note: remove the 'if' if its condition is always true
    10552 |         if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan))
          |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    10553 |                 return;
   kernel/sched/core.c:10529:17: note: initialize the variable 'cpu' to silence this warning
    10529 |         int weight, cpu;
          |                        ^
          |                         = 0
   kernel/sched/core.c:1809:1: warning: unused function 'uclamp_update_active' [-Wunused-function]
    1809 | uclamp_update_active(struct task_struct *p)
         | ^~~~~~~~~~~~~~~~~~~~
   kernel/sched/core.c:6430:1: warning: unused function 'class_core_lock_lock_ptr' [-Wunused-function]
    6430 | DEFINE_LOCK_GUARD_1(core_lock, int,
         | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    6431 |                     sched_core_lock(*_T->lock, &_T->flags),
         |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    6432 |                     sched_core_unlock(*_T->lock, &_T->flags),
         |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    6433 |                     unsigned long flags)
         |                     ~~~~~~~~~~~~~~~~~~~~
   include/linux/cleanup.h:416:49: note: expanded from macro 'DEFINE_LOCK_GUARD_1'
     416 | __DEFINE_CLASS_IS_CONDITIONAL(_name, false);                            \
         |                                                                         ^
     417 | __DEFINE_UNLOCK_GUARD(_name, _type, _unlock, __VA_ARGS__)               \
         | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/cleanup.h:392:21: note: expanded from macro '\
   __DEFINE_UNLOCK_GUARD'
     392 | static inline void *class_##_name##_lock_ptr(class_##_name##_t *_T)     \
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~
   <scratch space>:63:1: note: expanded from here
      63 | class_core_lock_lock_ptr
         | ^~~~~~~~~~~~~~~~~~~~~~~~
   4 warnings generated.


vim +10552 kernel/sched/core.c

223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10522  
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10523  static void task_mm_cid_work(struct callback_head *work)
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10524  {
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10525  	unsigned long now = jiffies, old_scan, next_scan;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10526  	struct task_struct *t = current;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10527  	struct cpumask *cidmask;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10528  	struct mm_struct *mm;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10529  	int weight, cpu;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10530  
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10531  	SCHED_WARN_ON(t != container_of(work, struct task_struct, cid_work));
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10532  
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10533  	work->next = work;	/* Prevent double-add */
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10534  	if (t->flags & PF_EXITING)
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10535  		return;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10536  	mm = t->mm;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10537  	if (!mm)
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10538  		return;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10539  	old_scan = READ_ONCE(mm->mm_cid_next_scan);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10540  	next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10541  	if (!old_scan) {
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10542  		unsigned long res;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10543  
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10544  		res = cmpxchg(&mm->mm_cid_next_scan, old_scan, next_scan);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10545  		if (res != old_scan)
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10546  			old_scan = res;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10547  		else
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10548  			old_scan = next_scan;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10549  	}
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10550  	if (time_before(now, old_scan))
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10551  		return;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 @10552  	if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan))
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10553  		return;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10554  	cidmask = mm_cidmask(mm);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10555  	/* Clear cids that were not recently used. */
2909dedb7586e2c Gabriele Monaco   2024-12-02  10556  	for_each_cpu_from(cpu, cidmask)
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10557  		sched_mm_cid_remote_clear_old(mm, cpu);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10558  	weight = cpumask_weight(cidmask);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10559  	/*
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10560  	 * Clear cids that are greater or equal to the cidmask weight to
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10561  	 * recompact it.
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10562  	 */
2909dedb7586e2c Gabriele Monaco   2024-12-02  10563  	for_each_cpu_from(cpu, cidmask)
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10564  		sched_mm_cid_remote_clear_weight(mm, cpu, weight);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10565  }
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20  10566  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH 1/2] sched: Optimise task_mm_cid_work duration

Posted by Mathieu Desnoyers 1 year, 2 months ago

On 2024-12-02 09:07, Gabriele Monaco wrote:
> The current behaviour of task_mm_cid_work is to loop through all
> possible CPUs twice to clean up old mm_cid remotely, this can be a waste
> of resources especially on tasks with a CPU affinity.
> 
> This patch reduces the CPUs involved in the remote CID cleanup carried
> on by task_mm_cid_work.
> 
> Using the mm_cidmask for the remote cleanup can considerably reduce the
> function runtime in highly isolated environments, where each process has
> affinity to a single core.  Likewise, in the worst case, the mask is
> equivalent to all possible CPUs and we don't see any difference with the
> current behaviour.
> 
> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
> ---
>   kernel/sched/core.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 95e40895a519..57b50b5952fa 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -10553,14 +10553,14 @@ static void task_mm_cid_work(struct callback_head *work)
>   		return;
>   	cidmask = mm_cidmask(mm);
>   	/* Clear cids that were not recently used. */
> -	for_each_possible_cpu(cpu)
> +	for_each_cpu_from(cpu, cidmask)

Hi Gabriele,

Thanks for looking into this. I understand that you are after minimizing the
latency introduced by task_mm_cid_work on isolated cores. I think we'll need
to think a bit harder, because the proposed solution does not work:

  * for_each_cpu_from - iterate over CPUs present in @mask, from @cpu to the end of @mask.

cpu is uninitialized. So this is completely broken. Was this tested
against a workload that actually uses concurrency IDs to ensure it does
not break the whole thing ? Did you run the rseq selftests ?

Also, the mm_cidmask is a mask of concurrency IDs, not a mask of CPUs. So
using it to iterate on CPUs is wrong.

Mathieu

>   		sched_mm_cid_remote_clear_old(mm, cpu);
>   	weight = cpumask_weight(cidmask);
>   	/*
>   	 * Clear cids that are greater or equal to the cidmask weight to
>   	 * recompact it.
>   	 */
> -	for_each_possible_cpu(cpu)
> +	for_each_cpu_from(cpu, cidmask)
>   		sched_mm_cid_remote_clear_weight(mm, cpu, weight);
>   }
>   

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH 1/2] sched: Optimise task_mm_cid_work duration

Posted by Gabriele Monaco 1 year, 2 months ago

Hi Mathieu,

thanks for the quick reply.

> Thanks for looking into this. I understand that you are after
> minimizing the
> latency introduced by task_mm_cid_work on isolated cores. I think
> we'll need
> to think a bit harder, because the proposed solution does not work:
> 
>   * for_each_cpu_from - iterate over CPUs present in @mask, from @cpu
> to the end of @mask.
> 
> cpu is uninitialized. So this is completely broken. 

My bad, wrong macro.. Should be for_each_cpu

> Was this tested
> against a workload that actually uses concurrency IDs to ensure it
> does
> not break the whole thing ? Did you run the rseq selftests ?
> 

I did run the stress-ng --rseq command for a while and didn't see any
error reported, but it's probably not bulletproof. I'll use the
selftests for the next iterations.

> Also, the mm_cidmask is a mask of concurrency IDs, not a mask of
> CPUs. So
> using it to iterate on CPUs is wrong.
> 

Mmh I get it, during my tests I was definitely getting better results
than using the mm_cpus_allowed mask, but I guess that was a broken test
so it just doesn't count..
Do you think using mm_cpus_allowed would make more sense, with the
/risk/ of being a bit over-cautious?


Gabriele

Re: [PATCH 1/2] sched: Optimise task_mm_cid_work duration

Posted by Mathieu Desnoyers 1 year, 2 months ago

On 2024-12-02 09:56, Gabriele Monaco wrote:
> Hi Mathieu,
> 
> thanks for the quick reply.
> 
>> Thanks for looking into this. I understand that you are after
>> minimizing the
>> latency introduced by task_mm_cid_work on isolated cores. I think
>> we'll need
>> to think a bit harder, because the proposed solution does not work:
>>
>>    * for_each_cpu_from - iterate over CPUs present in @mask, from @cpu
>> to the end of @mask.
>>
>> cpu is uninitialized. So this is completely broken.
> 
> My bad, wrong macro.. Should be for_each_cpu
> 
>> Was this tested
>> against a workload that actually uses concurrency IDs to ensure it
>> does
>> not break the whole thing ? Did you run the rseq selftests ?
>>
> 
> I did run the stress-ng --rseq command for a while and didn't see any
> error reported, but it's probably not bulletproof. I'll use the
> selftests for the next iterations.
> 
>> Also, the mm_cidmask is a mask of concurrency IDs, not a mask of
>> CPUs. So
>> using it to iterate on CPUs is wrong.
>>
> 
> Mmh I get it, during my tests I was definitely getting better results
> than using the mm_cpus_allowed mask, but I guess that was a broken test
> so it just doesn't count..
> Do you think using mm_cpus_allowed would make more sense, with the
> /risk/ of being a bit over-cautious?

mm_cpus_allowed can be updated dynamically by setting cpu affinity
and changing the cpusets. If we change the iteration from each possible
cpus to allowed cpus, then we need to adapt the allowed cpus updates
with the associated updates to the mm_cid as well. This is adding
complexity.

I understand that you wish to offload this task_work to a non-isolated
CPU (non-RT). If you do so, do you really care about the duration of
task_mm_cid_work enough to justify the added complexity to the
cpu affinity/cpusets updates ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH 1/2] sched: Optimise task_mm_cid_work duration

Posted by Gabriele Monaco 1 year, 2 months ago


On Mon, 2024-12-02 at 10:01 -0500, Mathieu Desnoyers wrote:
> 
> mm_cpus_allowed can be updated dynamically by setting cpu affinity
> and changing the cpusets. If we change the iteration from each
> possible
> cpus to allowed cpus, then we need to adapt the allowed cpus updates
> with the associated updates to the mm_cid as well. This is adding
> complexity.
> 
> I understand that you wish to offload this task_work to a non-
> isolated
> CPU (non-RT). If you do so, do you really care about the duration of
> task_mm_cid_work enough to justify the added complexity to the
> cpu affinity/cpusets updates ?
> 

Well, no we don't really care at this point.. If it isn't a quick win
I'd say I can simply remove it from the patchset, for the current use
case it doesn't really matter.

Thanks for the analysis.
Gabriele