The current behaviour of task_mm_cid_work is to loop through all
possible CPUs twice to clean up old mm_cid remotely, this can be a waste
of resources especially on tasks with a CPU affinity.
This patch reduces the CPUs involved in the remote CID cleanup carried
on by task_mm_cid_work.
Using the mm_cidmask for the remote cleanup can considerably reduce the
function runtime in highly isolated environments, where each process has
affinity to a single core. Likewise, in the worst case, the mask is
equivalent to all possible CPUs and we don't see any difference with the
current behaviour.
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
---
kernel/sched/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 95e40895a519..57b50b5952fa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10553,14 +10553,14 @@ static void task_mm_cid_work(struct callback_head *work)
return;
cidmask = mm_cidmask(mm);
/* Clear cids that were not recently used. */
- for_each_possible_cpu(cpu)
+ for_each_cpu_from(cpu, cidmask)
sched_mm_cid_remote_clear_old(mm, cpu);
weight = cpumask_weight(cidmask);
/*
* Clear cids that are greater or equal to the cidmask weight to
* recompact it.
*/
- for_each_possible_cpu(cpu)
+ for_each_cpu_from(cpu, cidmask)
sched_mm_cid_remote_clear_weight(mm, cpu, weight);
}
--
2.47.0
Hi Gabriele,
kernel test robot noticed the following build warnings:
[auto build test WARNING on e70140ba0d2b1a30467d4af6bcfe761327b9ec95]
url: https://github.com/intel-lab-lkp/linux/commits/Gabriele-Monaco/sched-Optimise-task_mm_cid_work-duration/20241203-003033
base: e70140ba0d2b1a30467d4af6bcfe761327b9ec95
patch link: https://lore.kernel.org/r/20241202140735.56368-2-gmonaco%40redhat.com
patch subject: [PATCH 1/2] sched: Optimise task_mm_cid_work duration
config: arm64-randconfig-004-20241203 (https://download.01.org/0day-ci/archive/20241203/202412031029.B7lLh63F-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 592c0fe55f6d9a811028b5f3507be91458ab2713)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241203/202412031029.B7lLh63F-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412031029.B7lLh63F-lkp@intel.com/
All warnings (new ones prefixed by >>):
In file included from kernel/sched/core.c:10:
In file included from include/linux/highmem.h:8:
In file included from include/linux/cacheflush.h:5:
In file included from arch/arm64/include/asm/cacheflush.h:11:
In file included from include/linux/kgdb.h:19:
In file included from include/linux/kprobes.h:28:
In file included from include/linux/ftrace.h:13:
In file included from include/linux/kallsyms.h:13:
In file included from include/linux/mm.h:2223:
include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
518 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
| ~~~~~~~~~~~ ^ ~~~
>> kernel/sched/core.c:10552:6: warning: variable 'cpu' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
10552 | if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
kernel/sched/core.c:10556:20: note: uninitialized use occurs here
10556 | for_each_cpu_from(cpu, cidmask)
| ^~~
include/linux/cpumask.h:391:24: note: expanded from macro 'for_each_cpu_from'
391 | for_each_set_bit_from(cpu, cpumask_bits(mask), small_cpumask_bits)
| ^~~
include/linux/find.h:605:48: note: expanded from macro 'for_each_set_bit_from'
605 | for (; (bit) = find_next_bit((addr), (size), (bit)), (bit) < (size); (bit)++)
| ^~~
kernel/sched/core.c:10552:2: note: remove the 'if' if its condition is always true
10552 | if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
10553 | return;
kernel/sched/core.c:10529:17: note: initialize the variable 'cpu' to silence this warning
10529 | int weight, cpu;
| ^
| = 0
kernel/sched/core.c:1809:1: warning: unused function 'uclamp_update_active' [-Wunused-function]
1809 | uclamp_update_active(struct task_struct *p)
| ^~~~~~~~~~~~~~~~~~~~
kernel/sched/core.c:6430:1: warning: unused function 'class_core_lock_lock_ptr' [-Wunused-function]
6430 | DEFINE_LOCK_GUARD_1(core_lock, int,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6431 | sched_core_lock(*_T->lock, &_T->flags),
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6432 | sched_core_unlock(*_T->lock, &_T->flags),
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6433 | unsigned long flags)
| ~~~~~~~~~~~~~~~~~~~~
include/linux/cleanup.h:416:49: note: expanded from macro 'DEFINE_LOCK_GUARD_1'
416 | __DEFINE_CLASS_IS_CONDITIONAL(_name, false); \
| ^
417 | __DEFINE_UNLOCK_GUARD(_name, _type, _unlock, __VA_ARGS__) \
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/cleanup.h:392:21: note: expanded from macro '\
__DEFINE_UNLOCK_GUARD'
392 | static inline void *class_##_name##_lock_ptr(class_##_name##_t *_T) \
| ^~~~~~~~~~~~~~~~~~~~~~~~
<scratch space>:63:1: note: expanded from here
63 | class_core_lock_lock_ptr
| ^~~~~~~~~~~~~~~~~~~~~~~~
4 warnings generated.
vim +10552 kernel/sched/core.c
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10522
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10523 static void task_mm_cid_work(struct callback_head *work)
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10524 {
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10525 unsigned long now = jiffies, old_scan, next_scan;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10526 struct task_struct *t = current;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10527 struct cpumask *cidmask;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10528 struct mm_struct *mm;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10529 int weight, cpu;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10530
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10531 SCHED_WARN_ON(t != container_of(work, struct task_struct, cid_work));
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10532
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10533 work->next = work; /* Prevent double-add */
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10534 if (t->flags & PF_EXITING)
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10535 return;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10536 mm = t->mm;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10537 if (!mm)
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10538 return;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10539 old_scan = READ_ONCE(mm->mm_cid_next_scan);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10540 next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10541 if (!old_scan) {
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10542 unsigned long res;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10543
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10544 res = cmpxchg(&mm->mm_cid_next_scan, old_scan, next_scan);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10545 if (res != old_scan)
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10546 old_scan = res;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10547 else
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10548 old_scan = next_scan;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10549 }
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10550 if (time_before(now, old_scan))
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10551 return;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 @10552 if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan))
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10553 return;
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10554 cidmask = mm_cidmask(mm);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10555 /* Clear cids that were not recently used. */
2909dedb7586e2c Gabriele Monaco 2024-12-02 10556 for_each_cpu_from(cpu, cidmask)
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10557 sched_mm_cid_remote_clear_old(mm, cpu);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10558 weight = cpumask_weight(cidmask);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10559 /*
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10560 * Clear cids that are greater or equal to the cidmask weight to
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10561 * recompact it.
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10562 */
2909dedb7586e2c Gabriele Monaco 2024-12-02 10563 for_each_cpu_from(cpu, cidmask)
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10564 sched_mm_cid_remote_clear_weight(mm, cpu, weight);
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10565 }
223baf9d17f25e2 Mathieu Desnoyers 2023-04-20 10566
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On 2024-12-02 09:07, Gabriele Monaco wrote: > The current behaviour of task_mm_cid_work is to loop through all > possible CPUs twice to clean up old mm_cid remotely, this can be a waste > of resources especially on tasks with a CPU affinity. > > This patch reduces the CPUs involved in the remote CID cleanup carried > on by task_mm_cid_work. > > Using the mm_cidmask for the remote cleanup can considerably reduce the > function runtime in highly isolated environments, where each process has > affinity to a single core. Likewise, in the worst case, the mask is > equivalent to all possible CPUs and we don't see any difference with the > current behaviour. > > Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> > --- > kernel/sched/core.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 95e40895a519..57b50b5952fa 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -10553,14 +10553,14 @@ static void task_mm_cid_work(struct callback_head *work) > return; > cidmask = mm_cidmask(mm); > /* Clear cids that were not recently used. */ > - for_each_possible_cpu(cpu) > + for_each_cpu_from(cpu, cidmask) Hi Gabriele, Thanks for looking into this. I understand that you are after minimizing the latency introduced by task_mm_cid_work on isolated cores. I think we'll need to think a bit harder, because the proposed solution does not work: * for_each_cpu_from - iterate over CPUs present in @mask, from @cpu to the end of @mask. cpu is uninitialized. So this is completely broken. Was this tested against a workload that actually uses concurrency IDs to ensure it does not break the whole thing ? Did you run the rseq selftests ? Also, the mm_cidmask is a mask of concurrency IDs, not a mask of CPUs. So using it to iterate on CPUs is wrong. Mathieu > sched_mm_cid_remote_clear_old(mm, cpu); > weight = cpumask_weight(cidmask); > /* > * Clear cids that are greater or equal to the cidmask weight to > * recompact it. > */ > - for_each_possible_cpu(cpu) > + for_each_cpu_from(cpu, cidmask) > sched_mm_cid_remote_clear_weight(mm, cpu, weight); > } > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Hi Mathieu, thanks for the quick reply. > Thanks for looking into this. I understand that you are after > minimizing the > latency introduced by task_mm_cid_work on isolated cores. I think > we'll need > to think a bit harder, because the proposed solution does not work: > > * for_each_cpu_from - iterate over CPUs present in @mask, from @cpu > to the end of @mask. > > cpu is uninitialized. So this is completely broken. My bad, wrong macro.. Should be for_each_cpu > Was this tested > against a workload that actually uses concurrency IDs to ensure it > does > not break the whole thing ? Did you run the rseq selftests ? > I did run the stress-ng --rseq command for a while and didn't see any error reported, but it's probably not bulletproof. I'll use the selftests for the next iterations. > Also, the mm_cidmask is a mask of concurrency IDs, not a mask of > CPUs. So > using it to iterate on CPUs is wrong. > Mmh I get it, during my tests I was definitely getting better results than using the mm_cpus_allowed mask, but I guess that was a broken test so it just doesn't count.. Do you think using mm_cpus_allowed would make more sense, with the /risk/ of being a bit over-cautious? Gabriele
On 2024-12-02 09:56, Gabriele Monaco wrote: > Hi Mathieu, > > thanks for the quick reply. > >> Thanks for looking into this. I understand that you are after >> minimizing the >> latency introduced by task_mm_cid_work on isolated cores. I think >> we'll need >> to think a bit harder, because the proposed solution does not work: >> >> * for_each_cpu_from - iterate over CPUs present in @mask, from @cpu >> to the end of @mask. >> >> cpu is uninitialized. So this is completely broken. > > My bad, wrong macro.. Should be for_each_cpu > >> Was this tested >> against a workload that actually uses concurrency IDs to ensure it >> does >> not break the whole thing ? Did you run the rseq selftests ? >> > > I did run the stress-ng --rseq command for a while and didn't see any > error reported, but it's probably not bulletproof. I'll use the > selftests for the next iterations. > >> Also, the mm_cidmask is a mask of concurrency IDs, not a mask of >> CPUs. So >> using it to iterate on CPUs is wrong. >> > > Mmh I get it, during my tests I was definitely getting better results > than using the mm_cpus_allowed mask, but I guess that was a broken test > so it just doesn't count.. > Do you think using mm_cpus_allowed would make more sense, with the > /risk/ of being a bit over-cautious? mm_cpus_allowed can be updated dynamically by setting cpu affinity and changing the cpusets. If we change the iteration from each possible cpus to allowed cpus, then we need to adapt the allowed cpus updates with the associated updates to the mm_cid as well. This is adding complexity. I understand that you wish to offload this task_work to a non-isolated CPU (non-RT). If you do so, do you really care about the duration of task_mm_cid_work enough to justify the added complexity to the cpu affinity/cpusets updates ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Mon, 2024-12-02 at 10:01 -0500, Mathieu Desnoyers wrote: > > mm_cpus_allowed can be updated dynamically by setting cpu affinity > and changing the cpusets. If we change the iteration from each > possible > cpus to allowed cpus, then we need to adapt the allowed cpus updates > with the associated updates to the mm_cid as well. This is adding > complexity. > > I understand that you wish to offload this task_work to a non- > isolated > CPU (non-RT). If you do so, do you really care about the duration of > task_mm_cid_work enough to justify the added complexity to the > cpu affinity/cpusets updates ? > Well, no we don't really care at this point.. If it isn't a quick win I'd say I can simply remove it from the patchset, for the current use case it doesn't really matter. Thanks for the analysis. Gabriele
© 2016 - 2026 Red Hat, Inc.