[v1] sched: Improve cache locality of RSEQ concurrency IDs

[RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

Posted by Mathieu Desnoyers 1 year, 5 months ago

commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
a reference to the concurrency id allocated for each CPU. This reference
expires shortly after a 100ms delay.

These per-CPU references keep the per-mm-cid data cache-local in
situations where threads are running at least once on each CPU within
each 100ms window, thus keeping the per-cpu reference alive.

However, intermittent workloads behaving in bursts spaced by more than
100ms on each CPU exhibit bad cache locality and degraded performance
compared to purely per-cpu data indexing, because concurrency IDs are
allocated over various CPUs and cores, therefore losing cache locality
of the associated data.

Introduce the following changes to improve per-mm-cid cache locality:

- Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
  track of which mm_cid value was last used, and use it as a hint to
  attempt re-allocating the same concurrency ID the next time this
  mm/cpu needs to allocate a concurrency ID,

- Add a per-mm CPUs allowed mask, which keeps track of the union of
  CPUs allowed for all threads belonging to this mm. This cpumask is
  only set during the lifetime of the mm, never cleared, so it
  represents the union of all the CPUs allowed since the beginning of
  the mm lifetime. (note that the mm_cpumask() is really arch-specific
  and tailored to the TLB flush needs, and is thus _not_ a viable
  approach for this)

- Add a per-mm nr_cpus_allowed to keep track of the weight of the
  per-mm CPUs allowed mask (for fast access),

- Add a per-mm nr_cids_used to keep track of the highest concurrency
  ID allocated for the mm. This is used for expanding the concurrency ID
  allocation within the upper bound defined by:

    min(mm->nr_cpus_allowed, mm->mm_users)

  When the next unused CID value reaches this threshold, stop trying
  to expand the cid allocation and use the first available cid value
  instead.

Spreading allocation to use all the cid values within the range

  [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]

improves cache locality while preserving mm_cid compactness within the
expected user limits.

- In __mm_cid_try_get, only return cid values within the range
  [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
  prevents allocating cids above the number of allowed cpus in
  rare scenarios where cid allocation races with a concurrent
  remote-clear of the per-mm/cpu cid. This improvement is made
  possible by the addition of the per-mm CPUs allowed mask.

- In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
  t->nr_cpus_allowed. This criterion was really meant to compare
  the number of mm->mm_users to the number of CPUs allowed for the
  entire mm. Therefore, the prior comparison worked fine when all
  threads shared the same CPUs allowed mask, but not so much in
  scenarios where those threads have different masks (e.g. each
  thread pinned to a single CPU). This improvement is made
  possible by the addition of the per-mm CPUs allowed mask.

* Benchmarks

Each thread increments 16kB worth of 8-bit integers in bursts, with
a configurable delay between each thread's execution. Each thread run
one after the other (no threads run concurrently). The order of
thread execution in the sequence is random. The thread execution
sequence begins again after all threads have executed. The 16kB areas
are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
(not cache-local), or cache-local mm_cid. Each thread is pinned to its
own core.

Testing configurations:

8-core/1-L3:        Use 8 cores within a single L3
24-core/24-L3:      Use 24 cores, 1 core per L3
192-core/24-L3:     Use 192 cores (all cores in the system)
384-thread/24-L3:   Use 384 HW threads (all HW threads in the system)

Intermittent workload delays between threads: 200ms, 10ms.

Hardware:

CPU(s):                   384
  On-line CPU(s) list:    0-383
Vendor ID:                AuthenticAMD
  Model name:             AMD EPYC 9654 96-Core Processor
    Thread(s) per core:   2
    Core(s) per socket:   96
    Socket(s):            2
Caches (sum of all):
  L1d:                    6 MiB (192 instances)
  L1i:                    6 MiB (192 instances)
  L2:                     192 MiB (192 instances)
  L3:                     768 MiB (24 instances)

Each result is an average of 5 test runs. The cache-local speedup
is calculated as: (cache-local mm_cid) / (mm_cid).

Intermittent workload delay: 200ms

                     per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
                         (ns)      (ns)                  (ns)
8-core/1-L3             1374      19289                  1336            14.4x
24-core/24-L3           2423      26721                  1594            16.7x
192-core/24-L3          2291      15826                  2153             7.3x
384-thread/24-L3        1874      13234                  1907             6.9x

Intermittent workload delay: 10ms

                     per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
                         (ns)      (ns)                  (ns)
8-core/1-L3               662       756                   686             1.1x
24-core/24-L3            1378      3648                  1035             3.5x
192-core/24-L3           1439     10833                  1482             7.3x
384-thread/24-L3         1503     10570                  1556             6.8x

[ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
  patch series with a simpler and more general approach. ]

Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
---
 fs/exec.c                |  2 +-
 include/linux/mm_types.h | 66 ++++++++++++++++++++++++++++++++++------
 kernel/fork.c            |  2 +-
 kernel/sched/core.c      |  7 +++--
 kernel/sched/sched.h     | 47 +++++++++++++++++++---------
 5 files changed, 97 insertions(+), 27 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 0c17e59e3767..7e73b0fc1305 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1039,7 +1039,7 @@ static int exec_mmap(struct mm_struct *mm)
 	active_mm = tsk->active_mm;
 	tsk->active_mm = mm;
 	tsk->mm = mm;
-	mm_init_cid(mm);
+	mm_init_cid(mm, tsk);
 	/*
 	 * This prevents preemption while active_mm is being loaded and
 	 * it and mm are being updated, which could cause problems for
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index af3a0256fa93..7d63d27862e4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -755,6 +755,7 @@ struct vm_area_struct {
 struct mm_cid {
 	u64 time;
 	int cid;
+	int recent_cid;
 };
 #endif
 
@@ -825,6 +826,20 @@ struct mm_struct {
 		 * When the next mm_cid scan is due (in jiffies).
 		 */
 		unsigned long mm_cid_next_scan;
+		/**
+		 * @nr_cpus_allowed: Number of CPUs allowed for mm.
+		 *
+		 * Number of CPUs allowed in the union of all mm's
+		 * threads allowed CPUs.
+		 */
+		atomic_t nr_cpus_allowed;
+		/**
+		 * @nr_cids_used: Number of used concurrency IDs.
+		 *
+		 * Track the highest concurrency ID allocated for the
+		 * mm: nr_cids_used - 1.
+		 */
+		atomic_t nr_cids_used;
 #endif
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* size of all page tables */
@@ -1143,18 +1158,30 @@ static inline int mm_cid_clear_lazy_put(int cid)
 	return cid & ~MM_CID_LAZY_PUT;
 }
 
+/*
+ * mm_cpus_allowed: Union of all mm's threads allowed CPUs.
+ */
+static inline cpumask_t *mm_cpus_allowed(struct mm_struct *mm)
+{
+	unsigned long bitmap = (unsigned long)mm;
+
+	bitmap += offsetof(struct mm_struct, cpu_bitmap);
+	/* Skip cpu_bitmap */
+	bitmap += cpumask_size();
+	return (struct cpumask *)bitmap;
+}
+
 /* Accessor for struct mm_struct's cidmask. */
 static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
 {
-	unsigned long cid_bitmap = (unsigned long)mm;
+	unsigned long cid_bitmap = (unsigned long)mm_cpus_allowed(mm);
 
-	cid_bitmap += offsetof(struct mm_struct, cpu_bitmap);
-	/* Skip cpu_bitmap */
+	/* Skip mm_cpus_allowed */
 	cid_bitmap += cpumask_size();
 	return (struct cpumask *)cid_bitmap;
 }
 
-static inline void mm_init_cid(struct mm_struct *mm)
+static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
 {
 	int i;
 
@@ -1162,17 +1189,21 @@ static inline void mm_init_cid(struct mm_struct *mm)
 		struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
 
 		pcpu_cid->cid = MM_CID_UNSET;
+		pcpu_cid->recent_cid = MM_CID_UNSET;
 		pcpu_cid->time = 0;
 	}
+	atomic_set(&mm->nr_cpus_allowed, p->nr_cpus_allowed);
+	atomic_set(&mm->nr_cids_used, 0);
+	cpumask_copy(mm_cpus_allowed(mm), p->cpus_ptr);
 	cpumask_clear(mm_cidmask(mm));
 }
 
-static inline int mm_alloc_cid_noprof(struct mm_struct *mm)
+static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p)
 {
 	mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid);
 	if (!mm->pcpu_cid)
 		return -ENOMEM;
-	mm_init_cid(mm);
+	mm_init_cid(mm, p);
 	return 0;
 }
 #define mm_alloc_cid(...)	alloc_hooks(mm_alloc_cid_noprof(__VA_ARGS__))
@@ -1185,16 +1216,33 @@ static inline void mm_destroy_cid(struct mm_struct *mm)
 
 static inline unsigned int mm_cid_size(void)
 {
-	return cpumask_size();
+	return 2 * cpumask_size();	/* mm_cpus_allowed(), mm_cidmask(). */
+}
+
+static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
+{
+	struct cpumask *mm_allowed = mm_cpus_allowed(mm);
+	int cpu, nr_set = 0;
+
+	if (!mm)
+		return;
+	/* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
+	for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next_andnot(cpu, cpumask, mm_allowed)) {
+		if (!cpumask_test_and_set_cpu(cpu, mm_allowed))
+			nr_set++;
+	}
+	atomic_add(nr_set, &mm->nr_cpus_allowed);
 }
 #else /* CONFIG_SCHED_MM_CID */
-static inline void mm_init_cid(struct mm_struct *mm) { }
-static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; }
+static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { }
+static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p) { return 0; }
 static inline void mm_destroy_cid(struct mm_struct *mm) { }
+
 static inline unsigned int mm_cid_size(void)
 {
 	return 0;
 }
+static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
 #endif /* CONFIG_SCHED_MM_CID */
 
 struct mmu_gather;
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d8..b44f545ad82c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1298,7 +1298,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	if (init_new_context(p, mm))
 		goto fail_nocontext;
 
-	if (mm_alloc_cid(mm))
+	if (mm_alloc_cid(mm, p))
 		goto fail_cid;
 
 	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e84a3b7b7bb..3243e9abfefb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2784,6 +2784,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
 		put_prev_task(rq, p);
 
 	p->sched_class->set_cpus_allowed(p, ctx);
+	mm_set_cpus_allowed(p->mm, ctx->new_mask);
 
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
@@ -11784,6 +11785,7 @@ int __sched_mm_cid_migrate_from_try_steal_cid(struct rq *src_rq,
 	 */
 	if (!try_cmpxchg(&src_pcpu_cid->cid, &lazy_cid, MM_CID_UNSET))
 		return -1;
+	WRITE_ONCE(src_pcpu_cid->recent_cid, MM_CID_UNSET);
 	return src_cid;
 }
 
@@ -11825,7 +11827,7 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
 	dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
 	dst_cid = READ_ONCE(dst_pcpu_cid->cid);
 	if (!mm_cid_is_unset(dst_cid) &&
-	    atomic_read(&mm->mm_users) >= t->nr_cpus_allowed)
+	    atomic_read(&mm->mm_users) >= atomic_read(&mm->nr_cpus_allowed))
 		return;
 	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, src_cpu);
 	src_rq = cpu_rq(src_cpu);
@@ -11843,6 +11845,7 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
 	/* Move src_cid to dst cpu. */
 	mm_cid_snapshot_time(dst_rq, mm);
 	WRITE_ONCE(dst_pcpu_cid->cid, src_cid);
+	WRITE_ONCE(dst_pcpu_cid->recent_cid, src_cid);
 }
 
 static void sched_mm_cid_remote_clear(struct mm_struct *mm, struct mm_cid *pcpu_cid,
@@ -12079,7 +12082,7 @@ void sched_mm_cid_after_execve(struct task_struct *t)
 		 * Matches barrier in sched_mm_cid_remote_clear_old().
 		 */
 		smp_mb();
-		t->last_mm_cid = t->mm_cid = mm_cid_get(rq, mm);
+		t->last_mm_cid = t->mm_cid = mm_cid_get(rq, t, mm);
 	}
 	rseq_set_notify_resume(t);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 38aeedd8a6cc..4d11dbd5847b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3311,24 +3311,40 @@ static inline void mm_cid_put(struct mm_struct *mm)
 	__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
 }
 
-static inline int __mm_cid_try_get(struct mm_struct *mm)
+static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
 {
-	struct cpumask *cpumask;
-	int cid;
+	struct cpumask *cidmask = mm_cidmask(mm);
+	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
+	int cid = __this_cpu_read(pcpu_cid->recent_cid);
 
-	cpumask = mm_cidmask(mm);
+	/* Try to re-use recent cid. This improves cache locality. */
+	if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask))
+		return cid;
+	/*
+	 * Expand cid allocation if used cids are below the number cpus
+	 * allowed and number of threads. Expanding cid allocation as
+	 * much as possible improves cache locality.
+	 */
+	cid = atomic_read(&mm->nr_cids_used);
+	while (cid < atomic_read(&mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) {
+		if (!atomic_try_cmpxchg(&mm->nr_cids_used, &cid, cid + 1))
+			continue;
+		if (!cpumask_test_and_set_cpu(cid, cidmask))
+			return cid;
+	}
 	/*
+	 * Find the first available concurrency id.
 	 * Retry finding first zero bit if the mask is temporarily
 	 * filled. This only happens during concurrent remote-clear
 	 * which owns a cid without holding a rq lock.
 	 */
 	for (;;) {
-		cid = cpumask_first_zero(cpumask);
-		if (cid < nr_cpu_ids)
+		cid = cpumask_first_zero(cidmask);
+		if (cid < atomic_read(&mm->nr_cpus_allowed))
 			break;
 		cpu_relax();
 	}
-	if (cpumask_test_and_set_cpu(cid, cpumask))
+	if (cpumask_test_and_set_cpu(cid, cidmask))
 		return -1;
 	return cid;
 }
@@ -3345,7 +3361,8 @@ static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
 	WRITE_ONCE(pcpu_cid->time, rq->clock);
 }
 
-static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
+static inline int __mm_cid_get(struct rq *rq, struct task_struct *t,
+			       struct mm_struct *mm)
 {
 	int cid;
 
@@ -3355,13 +3372,13 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
 	 * guarantee forward progress.
 	 */
 	if (!READ_ONCE(use_cid_lock)) {
-		cid = __mm_cid_try_get(mm);
+		cid = __mm_cid_try_get(t, mm);
 		if (cid >= 0)
 			goto end;
 		raw_spin_lock(&cid_lock);
 	} else {
 		raw_spin_lock(&cid_lock);
-		cid = __mm_cid_try_get(mm);
+		cid = __mm_cid_try_get(t, mm);
 		if (cid >= 0)
 			goto unlock;
 	}
@@ -3381,7 +3398,7 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
 	 * all newcoming allocations observe the use_cid_lock flag set.
 	 */
 	do {
-		cid = __mm_cid_try_get(mm);
+		cid = __mm_cid_try_get(t, mm);
 		cpu_relax();
 	} while (cid < 0);
 	/*
@@ -3397,7 +3414,8 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
 	return cid;
 }
 
-static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm)
+static inline int mm_cid_get(struct rq *rq, struct task_struct *t,
+			     struct mm_struct *mm)
 {
 	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
 	struct cpumask *cpumask;
@@ -3414,8 +3432,9 @@ static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm)
 		if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET))
 			__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
 	}
-	cid = __mm_cid_get(rq, mm);
+	cid = __mm_cid_get(rq, t, mm);
 	__this_cpu_write(pcpu_cid->cid, cid);
+	__this_cpu_write(pcpu_cid->recent_cid, cid);
 	return cid;
 }
 
@@ -3467,7 +3486,7 @@ static inline void switch_mm_cid(struct rq *rq,
 		prev->mm_cid = -1;
 	}
 	if (next->mm_cid_active)
-		next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next->mm);
+		next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
 }
 
 #else
-- 
2.39.2

Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

Posted by Yury Norov 1 year, 5 months ago

On Tue, Sep 03, 2024 at 03:06:50PM -0400, Mathieu Desnoyers wrote:
> commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
> introduced a per-mm/cpu current concurrency id (mm_cid), which keeps
> a reference to the concurrency id allocated for each CPU. This reference
> expires shortly after a 100ms delay.
> 
> These per-CPU references keep the per-mm-cid data cache-local in
> situations where threads are running at least once on each CPU within
> each 100ms window, thus keeping the per-cpu reference alive.
> 
> However, intermittent workloads behaving in bursts spaced by more than
> 100ms on each CPU exhibit bad cache locality and degraded performance
> compared to purely per-cpu data indexing, because concurrency IDs are
> allocated over various CPUs and cores, therefore losing cache locality
> of the associated data.
> 
> Introduce the following changes to improve per-mm-cid cache locality:
> 
> - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep
>   track of which mm_cid value was last used, and use it as a hint to
>   attempt re-allocating the same concurrency ID the next time this
>   mm/cpu needs to allocate a concurrency ID,
> 
> - Add a per-mm CPUs allowed mask, which keeps track of the union of
>   CPUs allowed for all threads belonging to this mm. This cpumask is
>   only set during the lifetime of the mm, never cleared, so it
>   represents the union of all the CPUs allowed since the beginning of
>   the mm lifetime. (note that the mm_cpumask() is really arch-specific
>   and tailored to the TLB flush needs, and is thus _not_ a viable
>   approach for this)
> 
> - Add a per-mm nr_cpus_allowed to keep track of the weight of the
>   per-mm CPUs allowed mask (for fast access),
> 
> - Add a per-mm nr_cids_used to keep track of the highest concurrency
>   ID allocated for the mm. This is used for expanding the concurrency ID
>   allocation within the upper bound defined by:
> 
>     min(mm->nr_cpus_allowed, mm->mm_users)
> 
>   When the next unused CID value reaches this threshold, stop trying
>   to expand the cid allocation and use the first available cid value
>   instead.
> 
> Spreading allocation to use all the cid values within the range
> 
>   [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ]
> 
> improves cache locality while preserving mm_cid compactness within the
> expected user limits.
> 
> - In __mm_cid_try_get, only return cid values within the range
>   [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This
>   prevents allocating cids above the number of allowed cpus in
>   rare scenarios where cid allocation races with a concurrent
>   remote-clear of the per-mm/cpu cid. This improvement is made
>   possible by the addition of the per-mm CPUs allowed mask.
> 
> - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than
>   t->nr_cpus_allowed. This criterion was really meant to compare
>   the number of mm->mm_users to the number of CPUs allowed for the
>   entire mm. Therefore, the prior comparison worked fine when all
>   threads shared the same CPUs allowed mask, but not so much in
>   scenarios where those threads have different masks (e.g. each
>   thread pinned to a single CPU). This improvement is made
>   possible by the addition of the per-mm CPUs allowed mask.
> 
> * Benchmarks
> 
> Each thread increments 16kB worth of 8-bit integers in bursts, with
> a configurable delay between each thread's execution. Each thread run
> one after the other (no threads run concurrently). The order of
> thread execution in the sequence is random. The thread execution
> sequence begins again after all threads have executed. The 16kB areas
> are allocated with rseq_mempool and indexed by either cpu_id, mm_cid
> (not cache-local), or cache-local mm_cid. Each thread is pinned to its
> own core.
> 
> Testing configurations:
> 
> 8-core/1-L3:        Use 8 cores within a single L3
> 24-core/24-L3:      Use 24 cores, 1 core per L3
> 192-core/24-L3:     Use 192 cores (all cores in the system)
> 384-thread/24-L3:   Use 384 HW threads (all HW threads in the system)
> 
> Intermittent workload delays between threads: 200ms, 10ms.
> 
> Hardware:
> 
> CPU(s):                   384
>   On-line CPU(s) list:    0-383
> Vendor ID:                AuthenticAMD
>   Model name:             AMD EPYC 9654 96-Core Processor
>     Thread(s) per core:   2
>     Core(s) per socket:   96
>     Socket(s):            2
> Caches (sum of all):
>   L1d:                    6 MiB (192 instances)
>   L1i:                    6 MiB (192 instances)
>   L2:                     192 MiB (192 instances)
>   L3:                     768 MiB (24 instances)
> 
> Each result is an average of 5 test runs. The cache-local speedup
> is calculated as: (cache-local mm_cid) / (mm_cid).
> 
> Intermittent workload delay: 200ms
> 
>                      per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
>                          (ns)      (ns)                  (ns)
> 8-core/1-L3             1374      19289                  1336            14.4x
> 24-core/24-L3           2423      26721                  1594            16.7x
> 192-core/24-L3          2291      15826                  2153             7.3x
> 384-thread/24-L3        1874      13234                  1907             6.9x
> 
> Intermittent workload delay: 10ms
> 
>                      per-cpu     mm_cid    cache-local mm_cid    cache-local speedup
>                          (ns)      (ns)                  (ns)
> 8-core/1-L3               662       756                   686             1.1x
> 24-core/24-L3            1378      3648                  1035             3.5x
> 192-core/24-L3           1439     10833                  1482             7.3x
> 384-thread/24-L3         1503     10570                  1556             6.8x
> 
> [ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency IDs"
>   patch series with a simpler and more general approach. ]
> 
> Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyers@efficios.com/
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Dmitry Vyukov <dvyukov@google.com>
> Cc: Marco Elver <elver@google.com>
> ---
>  fs/exec.c                |  2 +-
>  include/linux/mm_types.h | 66 ++++++++++++++++++++++++++++++++++------
>  kernel/fork.c            |  2 +-
>  kernel/sched/core.c      |  7 +++--
>  kernel/sched/sched.h     | 47 +++++++++++++++++++---------
>  5 files changed, 97 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 0c17e59e3767..7e73b0fc1305 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1039,7 +1039,7 @@ static int exec_mmap(struct mm_struct *mm)
>  	active_mm = tsk->active_mm;
>  	tsk->active_mm = mm;
>  	tsk->mm = mm;
> -	mm_init_cid(mm);
> +	mm_init_cid(mm, tsk);
>  	/*
>  	 * This prevents preemption while active_mm is being loaded and
>  	 * it and mm are being updated, which could cause problems for
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index af3a0256fa93..7d63d27862e4 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -755,6 +755,7 @@ struct vm_area_struct {
>  struct mm_cid {
>  	u64 time;
>  	int cid;
> +	int recent_cid;
>  };
>  #endif
>  
> @@ -825,6 +826,20 @@ struct mm_struct {
>  		 * When the next mm_cid scan is due (in jiffies).
>  		 */
>  		unsigned long mm_cid_next_scan;
> +		/**
> +		 * @nr_cpus_allowed: Number of CPUs allowed for mm.
> +		 *
> +		 * Number of CPUs allowed in the union of all mm's
> +		 * threads allowed CPUs.
> +		 */
> +		atomic_t nr_cpus_allowed;
> +		/**
> +		 * @nr_cids_used: Number of used concurrency IDs.
> +		 *
> +		 * Track the highest concurrency ID allocated for the
> +		 * mm: nr_cids_used - 1.
> +		 */
> +		atomic_t nr_cids_used;
>  #endif
>  #ifdef CONFIG_MMU
>  		atomic_long_t pgtables_bytes;	/* size of all page tables */
> @@ -1143,18 +1158,30 @@ static inline int mm_cid_clear_lazy_put(int cid)
>  	return cid & ~MM_CID_LAZY_PUT;
>  }
>  
> +/*
> + * mm_cpus_allowed: Union of all mm's threads allowed CPUs.
> + */
> +static inline cpumask_t *mm_cpus_allowed(struct mm_struct *mm)
> +{
> +	unsigned long bitmap = (unsigned long)mm;
> +
> +	bitmap += offsetof(struct mm_struct, cpu_bitmap);
> +	/* Skip cpu_bitmap */
> +	bitmap += cpumask_size();
> +	return (struct cpumask *)bitmap;
> +}
> +
>  /* Accessor for struct mm_struct's cidmask. */
>  static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
>  {
> -	unsigned long cid_bitmap = (unsigned long)mm;
> +	unsigned long cid_bitmap = (unsigned long)mm_cpus_allowed(mm);
>  
> -	cid_bitmap += offsetof(struct mm_struct, cpu_bitmap);
> -	/* Skip cpu_bitmap */
> +	/* Skip mm_cpus_allowed */
>  	cid_bitmap += cpumask_size();
>  	return (struct cpumask *)cid_bitmap;
>  }
>  
> -static inline void mm_init_cid(struct mm_struct *mm)
> +static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
>  {
>  	int i;
>  
> @@ -1162,17 +1189,21 @@ static inline void mm_init_cid(struct mm_struct *mm)
>  		struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
>  
>  		pcpu_cid->cid = MM_CID_UNSET;
> +		pcpu_cid->recent_cid = MM_CID_UNSET;
>  		pcpu_cid->time = 0;
>  	}
> +	atomic_set(&mm->nr_cpus_allowed, p->nr_cpus_allowed);
> +	atomic_set(&mm->nr_cids_used, 0);
> +	cpumask_copy(mm_cpus_allowed(mm), p->cpus_ptr);
>  	cpumask_clear(mm_cidmask(mm));
>  }
>  
> -static inline int mm_alloc_cid_noprof(struct mm_struct *mm)
> +static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p)
>  {
>  	mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid);
>  	if (!mm->pcpu_cid)
>  		return -ENOMEM;
> -	mm_init_cid(mm);
> +	mm_init_cid(mm, p);
>  	return 0;
>  }
>  #define mm_alloc_cid(...)	alloc_hooks(mm_alloc_cid_noprof(__VA_ARGS__))
> @@ -1185,16 +1216,33 @@ static inline void mm_destroy_cid(struct mm_struct *mm)
>  
>  static inline unsigned int mm_cid_size(void)
>  {
> -	return cpumask_size();
> +	return 2 * cpumask_size();	/* mm_cpus_allowed(), mm_cidmask(). */
> +}
> +
> +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
> +{
> +	struct cpumask *mm_allowed = mm_cpus_allowed(mm);
> +	int cpu, nr_set = 0;
> +
> +	if (!mm)
> +		return;
> +	/* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
> +	for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next_andnot(cpu, cpumask, mm_allowed)) {
> +		if (!cpumask_test_and_set_cpu(cpu, mm_allowed))
> +			nr_set++;
> +	}

You can do the same nicer:

  for_each_cpu(cpu, cpumask)
  	nr_set += !cpumask_test_and_set_cpu(cpu, mm_allowed);

This should be faster and a bit simpler, to me.

What concerns me is that you call atomic function in a loop, which makes
the whole procedure non-atomic. If it's OK, can you put a comment why a
series of atomic ops is OK here? If not - I believe external locking
would be needed.

Thanks,
Yury

> +	atomic_add(nr_set, &mm->nr_cpus_allowed);
>  }
>  #else /* CONFIG_SCHED_MM_CID */
> -static inline void mm_init_cid(struct mm_struct *mm) { }
> -static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; }
> +static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { }
> +static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p) { return 0; }
>  static inline void mm_destroy_cid(struct mm_struct *mm) { }
> +
>  static inline unsigned int mm_cid_size(void)
>  {
>  	return 0;
>  }
> +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
>  #endif /* CONFIG_SCHED_MM_CID */
>  
>  struct mmu_gather;
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 99076dbe27d8..b44f545ad82c 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1298,7 +1298,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>  	if (init_new_context(p, mm))
>  		goto fail_nocontext;
>  
> -	if (mm_alloc_cid(mm))
> +	if (mm_alloc_cid(mm, p))
>  		goto fail_cid;
>  
>  	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3e84a3b7b7bb..3243e9abfefb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2784,6 +2784,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
>  		put_prev_task(rq, p);
>  
>  	p->sched_class->set_cpus_allowed(p, ctx);
> +	mm_set_cpus_allowed(p->mm, ctx->new_mask);
>  
>  	if (queued)
>  		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
> @@ -11784,6 +11785,7 @@ int __sched_mm_cid_migrate_from_try_steal_cid(struct rq *src_rq,
>  	 */
>  	if (!try_cmpxchg(&src_pcpu_cid->cid, &lazy_cid, MM_CID_UNSET))
>  		return -1;
> +	WRITE_ONCE(src_pcpu_cid->recent_cid, MM_CID_UNSET);
>  	return src_cid;
>  }
>  
> @@ -11825,7 +11827,7 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
>  	dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
>  	dst_cid = READ_ONCE(dst_pcpu_cid->cid);
>  	if (!mm_cid_is_unset(dst_cid) &&
> -	    atomic_read(&mm->mm_users) >= t->nr_cpus_allowed)
> +	    atomic_read(&mm->mm_users) >= atomic_read(&mm->nr_cpus_allowed))
>  		return;
>  	src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, src_cpu);
>  	src_rq = cpu_rq(src_cpu);
> @@ -11843,6 +11845,7 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
>  	/* Move src_cid to dst cpu. */
>  	mm_cid_snapshot_time(dst_rq, mm);
>  	WRITE_ONCE(dst_pcpu_cid->cid, src_cid);
> +	WRITE_ONCE(dst_pcpu_cid->recent_cid, src_cid);
>  }
>  
>  static void sched_mm_cid_remote_clear(struct mm_struct *mm, struct mm_cid *pcpu_cid,
> @@ -12079,7 +12082,7 @@ void sched_mm_cid_after_execve(struct task_struct *t)
>  		 * Matches barrier in sched_mm_cid_remote_clear_old().
>  		 */
>  		smp_mb();
> -		t->last_mm_cid = t->mm_cid = mm_cid_get(rq, mm);
> +		t->last_mm_cid = t->mm_cid = mm_cid_get(rq, t, mm);
>  	}
>  	rseq_set_notify_resume(t);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 38aeedd8a6cc..4d11dbd5847b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3311,24 +3311,40 @@ static inline void mm_cid_put(struct mm_struct *mm)
>  	__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
>  }
>  
> -static inline int __mm_cid_try_get(struct mm_struct *mm)
> +static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
>  {
> -	struct cpumask *cpumask;
> -	int cid;
> +	struct cpumask *cidmask = mm_cidmask(mm);
> +	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
> +	int cid = __this_cpu_read(pcpu_cid->recent_cid);
>  
> -	cpumask = mm_cidmask(mm);
> +	/* Try to re-use recent cid. This improves cache locality. */
> +	if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask))
> +		return cid;
> +	/*
> +	 * Expand cid allocation if used cids are below the number cpus
> +	 * allowed and number of threads. Expanding cid allocation as
> +	 * much as possible improves cache locality.
> +	 */
> +	cid = atomic_read(&mm->nr_cids_used);
> +	while (cid < atomic_read(&mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) {
> +		if (!atomic_try_cmpxchg(&mm->nr_cids_used, &cid, cid + 1))
> +			continue;
> +		if (!cpumask_test_and_set_cpu(cid, cidmask))
> +			return cid;
> +	}
>  	/*
> +	 * Find the first available concurrency id.
>  	 * Retry finding first zero bit if the mask is temporarily
>  	 * filled. This only happens during concurrent remote-clear
>  	 * which owns a cid without holding a rq lock.
>  	 */
>  	for (;;) {
> -		cid = cpumask_first_zero(cpumask);
> -		if (cid < nr_cpu_ids)
> +		cid = cpumask_first_zero(cidmask);
> +		if (cid < atomic_read(&mm->nr_cpus_allowed))
>  			break;
>  		cpu_relax();
>  	}
> -	if (cpumask_test_and_set_cpu(cid, cpumask))
> +	if (cpumask_test_and_set_cpu(cid, cidmask))
>  		return -1;
>  	return cid;
>  }
> @@ -3345,7 +3361,8 @@ static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
>  	WRITE_ONCE(pcpu_cid->time, rq->clock);
>  }
>  
> -static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
> +static inline int __mm_cid_get(struct rq *rq, struct task_struct *t,
> +			       struct mm_struct *mm)
>  {
>  	int cid;
>  
> @@ -3355,13 +3372,13 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
>  	 * guarantee forward progress.
>  	 */
>  	if (!READ_ONCE(use_cid_lock)) {
> -		cid = __mm_cid_try_get(mm);
> +		cid = __mm_cid_try_get(t, mm);
>  		if (cid >= 0)
>  			goto end;
>  		raw_spin_lock(&cid_lock);
>  	} else {
>  		raw_spin_lock(&cid_lock);
> -		cid = __mm_cid_try_get(mm);
> +		cid = __mm_cid_try_get(t, mm);
>  		if (cid >= 0)
>  			goto unlock;
>  	}
> @@ -3381,7 +3398,7 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
>  	 * all newcoming allocations observe the use_cid_lock flag set.
>  	 */
>  	do {
> -		cid = __mm_cid_try_get(mm);
> +		cid = __mm_cid_try_get(t, mm);
>  		cpu_relax();
>  	} while (cid < 0);
>  	/*
> @@ -3397,7 +3414,8 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
>  	return cid;
>  }
>  
> -static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm)
> +static inline int mm_cid_get(struct rq *rq, struct task_struct *t,
> +			     struct mm_struct *mm)
>  {
>  	struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
>  	struct cpumask *cpumask;
> @@ -3414,8 +3432,9 @@ static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm)
>  		if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET))
>  			__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
>  	}
> -	cid = __mm_cid_get(rq, mm);
> +	cid = __mm_cid_get(rq, t, mm);
>  	__this_cpu_write(pcpu_cid->cid, cid);
> +	__this_cpu_write(pcpu_cid->recent_cid, cid);
>  	return cid;
>  }
>  
> @@ -3467,7 +3486,7 @@ static inline void switch_mm_cid(struct rq *rq,
>  		prev->mm_cid = -1;
>  	}
>  	if (next->mm_cid_active)
> -		next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next->mm);
> +		next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
>  }
>  
>  #else
> -- 
> 2.39.2

Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

Posted by Mathieu Desnoyers 1 year, 5 months ago

On 2024-09-03 15:59, Yury Norov wrote:
> On Tue, Sep 03, 2024 at 03:06:50PM -0400, Mathieu Desnoyers wrote:
[...]
>> +
>> +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
>> +{
>> +	struct cpumask *mm_allowed = mm_cpus_allowed(mm);
>> +	int cpu, nr_set = 0;
>> +
>> +	if (!mm)
>> +		return;
>> +	/* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
>> +	for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next_andnot(cpu, cpumask, mm_allowed)) {
>> +		if (!cpumask_test_and_set_cpu(cpu, mm_allowed))
>> +			nr_set++;
>> +	}
> 
> You can do the same nicer:
> 
>    for_each_cpu(cpu, cpumask)
>    	nr_set += !cpumask_test_and_set_cpu(cpu, mm_allowed);
> 
> This should be faster and a bit simpler, to me.

In this scenario, I expect the following per-thread cpumask properties 
for a given process (typically): those will be typically the same bits
set repeated over all threads belonging to a process. There are of
course scenarios where specific threads will override the mask, but
I don't expect this to be the most frequent case.

So we typically have an operation which initially copies the initial
thread's allowed cpus mask to the mm allowed cpus mask, and then when
additional affinity changes are done, we want to augment the mm allowed
cpus masks with any additional cpu that may show up. But again, I expect
the initial thread to typically have the complete mask and other
operations won't typically change the mm allowed cpumask bits.

I also expect the cpumask to be often quite dense (often all bits
are set).

Now if we look at the operations for your proposal here:

- for_each_cpu loads cpumask word-by-word and for each set bit, it
   issues cpumask_test_and_set_cpu on mm_allowed, which is really a
   test_and_set_bit, a fully ordered atomic operation, on each _bit_
   set. That's O(nr_cpus) fully ordered atomic operations, and thus
   expensive exclusive cache line accesses.

My approach does:

- The equivalent of a for_each_cpu_andnot (actually I should use
   exactly that! I just noticed it exists in the API.), which loads
   both thread and mm CPUs allowed masks in parallel, word-by-word,
   and only issues a cpumask_test_and_set_cpu for CPUs which are set
   in the per-thread mask, but not in the mm mask. In the typical cases
   discussed above, we pretty much never need to issue the atomic
   test-and-set. So all we need to do for the common case is to read
   both cpu masks in parallel, no stores/atomic ops needed.

> What concerns me is that you call atomic function in a loop, which makes
> the whole procedure non-atomic. If it's OK, can you put a comment why a
> series of atomic ops is OK here? If not - I believe external locking
> would be needed.

Good point, so the whole mm CPUs allowed masks tracks the allowed set,
and based on the result of the test_and_set it accumulates the number of
bits set and atomically add the total to nr_allowed_cpus. The mm_cid
algorithms only care about mm nr_allowed_cpus, so those don't even need
to look at the mm CPUs allowed mask, therefore there is no need to
provide any ordering guarantees across the two data structures.

If we'd have a cpumask_test_and_set_cpu_relaxed() it would be sufficient
here.

As you say, I should explain this in a comment.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

Posted by Yury Norov 1 year, 5 months ago

On Tue, Sep 03, 2024 at 07:22:37PM -0400, Mathieu Desnoyers wrote:
> On 2024-09-03 15:59, Yury Norov wrote:
> > On Tue, Sep 03, 2024 at 03:06:50PM -0400, Mathieu Desnoyers wrote:
> [...]
> > > +
> > > +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
> > > +{
> > > +	struct cpumask *mm_allowed = mm_cpus_allowed(mm);
> > > +	int cpu, nr_set = 0;
> > > +
> > > +	if (!mm)
> > > +		return;
> > > +	/* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
> > > +	for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next_andnot(cpu, cpumask, mm_allowed)) {
> > > +		if (!cpumask_test_and_set_cpu(cpu, mm_allowed))
> > > +			nr_set++;
> > > +	}
> > 
> > You can do the same nicer:
> > 
> >    for_each_cpu(cpu, cpumask)
> >    	nr_set += !cpumask_test_and_set_cpu(cpu, mm_allowed);
> > 
> > This should be faster and a bit simpler, to me.
> 
> In this scenario, I expect the following per-thread cpumask properties for a
> given process (typically): those will be typically the same bits
> set repeated over all threads belonging to a process. There are of
> course scenarios where specific threads will override the mask, but
> I don't expect this to be the most frequent case.
> 
> So we typically have an operation which initially copies the initial
> thread's allowed cpus mask to the mm allowed cpus mask, and then when
> additional affinity changes are done, we want to augment the mm allowed
> cpus masks with any additional cpu that may show up. But again, I expect
> the initial thread to typically have the complete mask and other
> operations won't typically change the mm allowed cpumask bits.
> 
> I also expect the cpumask to be often quite dense (often all bits
> are set).
> 
> Now if we look at the operations for your proposal here:
> 
> - for_each_cpu loads cpumask word-by-word and for each set bit, it
>   issues cpumask_test_and_set_cpu on mm_allowed, which is really a
>   test_and_set_bit, a fully ordered atomic operation, on each _bit_
>   set. That's O(nr_cpus) fully ordered atomic operations, and thus
>   expensive exclusive cache line accesses.

Both versions are O(N).
 
> My approach does:
> 
> - The equivalent of a for_each_cpu_andnot (actually I should use
>   exactly that! I just noticed it exists in the API.), which loads

Yes, you should.

>   both thread and mm CPUs allowed masks in parallel, word-by-word,
>   and only issues a cpumask_test_and_set_cpu for CPUs which are set
>   in the per-thread mask, but not in the mm mask. In the typical cases
>   discussed above, we pretty much never need to issue the atomic
>   test-and-set. So all we need to do for the common case is to read
>   both cpu masks in parallel, no stores/atomic ops needed.

This all doesn't look like a hot path. And anyways, speculating around
performance without numbers on hands sounds cheap.

In my experience, iterators with a very lightweight payload are ~100
times slower comparing to dedicated bitmap ops. Check this for example:
3cea8d4753277.

If you're really cared about performance here, I'd suggest you to
compare your iterators approach with something like this:

  cpumask_or(mm_allowed, mm_allowed, cpumask);
  atomic_set(&mm->nr_cpus_allowed, cpumask_weight(mm_allowed);

Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

Posted by Mathieu Desnoyers 1 year, 5 months ago

On 2024-09-04 11:24, Yury Norov wrote:
> On Tue, Sep 03, 2024 at 07:22:37PM -0400, Mathieu Desnoyers wrote:
>> On 2024-09-03 15:59, Yury Norov wrote:
>>> On Tue, Sep 03, 2024 at 03:06:50PM -0400, Mathieu Desnoyers wrote:
>> [...]
>>>> +
>>>> +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
>>>> +{
>>>> +	struct cpumask *mm_allowed = mm_cpus_allowed(mm);
>>>> +	int cpu, nr_set = 0;
>>>> +
>>>> +	if (!mm)
>>>> +		return;
>>>> +	/* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
>>>> +	for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next_andnot(cpu, cpumask, mm_allowed)) {
>>>> +		if (!cpumask_test_and_set_cpu(cpu, mm_allowed))
>>>> +			nr_set++;
>>>> +	}
>>>
>>> You can do the same nicer:
>>>
>>>     for_each_cpu(cpu, cpumask)
>>>     	nr_set += !cpumask_test_and_set_cpu(cpu, mm_allowed);
>>>
>>> This should be faster and a bit simpler, to me.
>>
>> In this scenario, I expect the following per-thread cpumask properties for a
>> given process (typically): those will be typically the same bits
>> set repeated over all threads belonging to a process. There are of
>> course scenarios where specific threads will override the mask, but
>> I don't expect this to be the most frequent case.
>>
>> So we typically have an operation which initially copies the initial
>> thread's allowed cpus mask to the mm allowed cpus mask, and then when
>> additional affinity changes are done, we want to augment the mm allowed
>> cpus masks with any additional cpu that may show up. But again, I expect
>> the initial thread to typically have the complete mask and other
>> operations won't typically change the mm allowed cpumask bits.
>>
>> I also expect the cpumask to be often quite dense (often all bits
>> are set).
>>
>> Now if we look at the operations for your proposal here:
>>
>> - for_each_cpu loads cpumask word-by-word and for each set bit, it
>>    issues cpumask_test_and_set_cpu on mm_allowed, which is really a
>>    test_and_set_bit, a fully ordered atomic operation, on each _bit_
>>    set. That's O(nr_cpus) fully ordered atomic operations, and thus
>>    expensive exclusive cache line accesses.
> 
> Both versions are O(N).

Yes, those are both theoretically O(N), but the cost of
loading/comparing two words compared to the cost of an atomic
test-and-set of each bit within those words is far from being
the same.

>   
>> My approach does:
>>
>> - The equivalent of a for_each_cpu_andnot (actually I should use
>>    exactly that! I just noticed it exists in the API.), which loads
> 
> Yes, you should.
> 
>>    both thread and mm CPUs allowed masks in parallel, word-by-word,
>>    and only issues a cpumask_test_and_set_cpu for CPUs which are set
>>    in the per-thread mask, but not in the mm mask. In the typical cases
>>    discussed above, we pretty much never need to issue the atomic
>>    test-and-set. So all we need to do for the common case is to read
>>    both cpu masks in parallel, no stores/atomic ops needed.
> 
> This all doesn't look like a hot path. And anyways, speculating around
> performance without numbers on hands sounds cheap.

This is done whenever userspace invokes sched_setaffinity, or changes
its cgroup cpuset. It may not be the most important fast-path in the
world, but I expect some workloads to issue sched_setaffinity whenever
they create a thread, so it's not a purely slow-path either.

> In my experience, iterators with a very lightweight payload are ~100
> times slower comparing to dedicated bitmap ops. Check this for example:
> 3cea8d4753277.
> 
> If you're really cared about performance here, I'd suggest you to
> compare your iterators approach with something like this:
> 
>    cpumask_or(mm_allowed, mm_allowed, cpumask);
>    atomic_set(&mm->nr_cpus_allowed, cpumask_weight(mm_allowed);

AFAIU this approach would not work, or then we'd need some kind of
locking to make sure we don't have two concurrent cpumask updates
which happen to do a sequence of atomic_set which move the
nr_cpus_allowed value towards a smaller value due to ordering.

Also, AFAIU cpumask_or is not safe against concurrent updates, so
we'd need locking for that as well.

I'm fine providing performance numbers, but we need to make sure
the alternative we compare against actually works.

A "dedicated" bitmap op that would do what I need would do the
following:

- For each bit set in bitmap A, which are not set in bitmap B,
   atomically test-and-set those bits in bitmap B, and return the
   number of bits that transitioned from 0 to 1 for weighting
   purposes.

In my original attempts I tried using cpumask_weight after altering
the bitmap, but then noticed that it would not work without additional
locking.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

Posted by Mathieu Desnoyers 1 year, 5 months ago

On 2024-09-04 11:50, Mathieu Desnoyers wrote:
> On 2024-09-04 11:24, Yury Norov wrote:
[...]
>>
>> This all doesn't look like a hot path. And anyways, speculating around
>> performance without numbers on hands sounds cheap.
> 
> This is done whenever userspace invokes sched_setaffinity, or changes
> its cgroup cpuset. It may not be the most important fast-path in the
> world, but I expect some workloads to issue sched_setaffinity whenever
> they create a thread, so it's not a purely slow-path either.
> 
>> In my experience, iterators with a very lightweight payload are ~100
>> times slower comparing to dedicated bitmap ops. Check this for example:
>> 3cea8d4753277.
>>
>> If you're really cared about performance here, I'd suggest you to
>> compare your iterators approach with something like this:
>>
>>    cpumask_or(mm_allowed, mm_allowed, cpumask);
>>    atomic_set(&mm->nr_cpus_allowed, cpumask_weight(mm_allowed);

Here are the benchmark results. Each test use two entirely filled
bitmaps as input to mimic the common scenario for cpus allowed
being updated with a subset of the original process CPUs allowed,
and also the common case where the initial cpumask is filled.

#define BITMAP_LEN      (4096UL * 8 * 10)
(len = BITMAP_LEN)

* Approach 1:

        int nr_set = 0;
        for_each_andnot_bit(bit, bitmap, bitmap2, len)
                nr_set += !test_and_set_bit(bit, bitmap2);
        if (nr_set)
                atomic_add(nr_set, &total);

Time: 4680 ns

* Approach 2:

        int nr_set = 0;
        for_each_set_bit(bit, bitmap, len)
                nr_set += !test_and_set_bit(bit, bitmap2);
        if (nr_set)
                atomic_add(nr_set, &total);

Time: 1791537 ns

* Approach 3:

        mutex_lock(&lock);
        bitmap_or(bitmap2, bitmap, bitmap2, len);
        atomic_set(&total, bitmap_weight(bitmap2, len));
        mutex_unlock(&lock);

Time: 79591 ns

So approach 1 is 382 times faster than approach 2, and 17 times
faster than approach 3. And this is only single-threaded,
I expect approaches 2&3 to perform even worse when contended
due to the many fully ordered test_and_set_bit (2) and due to
locking (3).

The test hardware is a AMD EPYC 9654 96-Core Processor.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

Posted by Mathieu Desnoyers 1 year, 5 months ago

On 2024-09-04 14:28, Mathieu Desnoyers wrote:
> On 2024-09-04 11:50, Mathieu Desnoyers wrote:
>> On 2024-09-04 11:24, Yury Norov wrote:
> [...]
>>>
>>> This all doesn't look like a hot path. And anyways, speculating around
>>> performance without numbers on hands sounds cheap.
>>
>> This is done whenever userspace invokes sched_setaffinity, or changes
>> its cgroup cpuset. It may not be the most important fast-path in the
>> world, but I expect some workloads to issue sched_setaffinity whenever
>> they create a thread, so it's not a purely slow-path either.
>>
>>> In my experience, iterators with a very lightweight payload are ~100
>>> times slower comparing to dedicated bitmap ops. Check this for example:
>>> 3cea8d4753277.
>>>
>>> If you're really cared about performance here, I'd suggest you to
>>> compare your iterators approach with something like this:
>>>
>>>    cpumask_or(mm_allowed, mm_allowed, cpumask);
>>>    atomic_set(&mm->nr_cpus_allowed, cpumask_weight(mm_allowed);
> 
> Here are the benchmark results. Each test use two entirely filled
> bitmaps as input to mimic the common scenario for cpus allowed
> being updated with a subset of the original process CPUs allowed,
> and also the common case where the initial cpumask is filled.
> 
> #define BITMAP_LEN      (4096UL * 8 * 10)
> (len = BITMAP_LEN)
> 
> * Approach 1:
> 
>         int nr_set = 0;
>         for_each_andnot_bit(bit, bitmap, bitmap2, len)
>                 nr_set += !test_and_set_bit(bit, bitmap2);
>         if (nr_set)
>                 atomic_add(nr_set, &total);
> 
> Time: 4680 ns
> 
> * Approach 2:
> 
>         int nr_set = 0;
>         for_each_set_bit(bit, bitmap, len)
>                 nr_set += !test_and_set_bit(bit, bitmap2);
>         if (nr_set)
>                 atomic_add(nr_set, &total);
> 
> Time: 1791537 ns
> 
> * Approach 3:
> 
>         mutex_lock(&lock);
>         bitmap_or(bitmap2, bitmap, bitmap2, len);
>         atomic_set(&total, bitmap_weight(bitmap2, len));
>         mutex_unlock(&lock);
> 
> Time: 79591 ns

The benchmark result is wrong for approach 3, as it was taken with
CONFIG_PROVE_LOCKING=y and lockdep.

Corrected result:

Time: 4500 ns.

So let's go with your approach. I'm wondering whether I should
re-use an existing mutex/spinlock from mm_struct or add a new one.

Thanks,

Mathieu

> 
> The test hardware is a AMD EPYC 9654 96-Core Processor.
> 
> Thanks,
> 
> Mathieu
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[RFC PATCH 1/2] cpumask: Implement cpumask_next_andnot
[RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads