From nobody Thu Nov 28 04:32:27 2024 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1D7FB1870 for ; Fri, 4 Oct 2024 00:46:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=167.114.26.122 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728002809; cv=none; b=bqYB00jrwolsC4a9U7CjbLrjl+dnK4j+5bYH4z2AZ9JYtkJTSDAMaPiZd6OcU4Apx4SSviKEOWXvdit4gqNhn16jQtT5N4LUaeQJPfVOllCc/Js9NbJ2/1ThEJ7LpMKkQ/vlq/4fHAtC9sA7cdmk8y/mQcjwG7rUyGpLLSjpzuA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728002809; c=relaxed/simple; bh=dySbhPJfdrivKjmkVeZe0TCJ+SETlpnFXXAiPgqejeY=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=HMTlWCtt3yZ1eWkGHjrXcbkUflXq+J8+Rxj8kloGvKQN3/2F7DBMUmLoZp9W3J50EZLZ3JlRcTyxfVSe/PlLL2GagRdBo59WHGFjwEmJD+lx/u1MWt+xfFbag9S3TPuj9myduZhGsANuKsAE8tB5iFlG9jOUDmQi7HEmy3nUFdM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com; spf=pass smtp.mailfrom=efficios.com; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b=VWBPPMfo; arc=none smtp.client-ip=167.114.26.122 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=efficios.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b="VWBPPMfo" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1728002805; bh=dySbhPJfdrivKjmkVeZe0TCJ+SETlpnFXXAiPgqejeY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=VWBPPMfoAVTlVGRo5BWY4T7Zf4VwSFF2TdO8VQmY/JJVD6S6tPWsVrAs4diClQMr7 S2j2esTWc5QWfDguQWbHWfB5FpWKvscfvCmwGsn4TjgpJOach4S7TEO14Y0JMUrGIS mUMzMy7DfCulo9S9Hc23mj3lTOj3fUiQL0GgvhNFo6Vbgyn8fRrlqfg0yCEz6bxWVE ANBZ8miNjxSQeBaKFomESYApCMjIRwRN9A+AYxs32eDxuMV+WgNw+SNheNhgufKr13 jScNWWLcSeboH3+ZZhIIPEDXyJDzbXnORvukSO4qboJ83OpGMz5KR7r/8h6Cvi7HGV WlFt2OZFOGZTA== Received: from thinkos.internal.efficios.com (unknown [IPv6:2606:6d00:100:4000:cacb:9855:de1f:ded2]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4XKVH86M07zBvf; Thu, 3 Oct 2024 20:46:44 -0400 (EDT) From: Mathieu Desnoyers To: Peter Zijlstra , Ingo Molnar Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Valentin Schneider , Mel Gorman , Steven Rostedt , Vincent Guittot , Dietmar Eggemann , Ben Segall , Yury Norov , Rasmus Villemoes , Marco Elver , Dmitry Vyukov Subject: [PATCH v1 1/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads Date: Thu, 3 Oct 2024 20:44:38 -0400 Message-Id: <20241004004439.1673801-2-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20241004004439.1673801-1-mathieu.desnoyers@efficios.com> References: <20241004004439.1673801-1-mathieu.desnoyers@efficios.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_c= id") introduced a per-mm/cpu current concurrency id (mm_cid), which keeps a reference to the concurrency id allocated for each CPU. This reference expires shortly after a 100ms delay. These per-CPU references keep the per-mm-cid data cache-local in situations where threads are running at least once on each CPU within each 100ms window, thus keeping the per-cpu reference alive. However, intermittent workloads behaving in bursts spaced by more than 100ms on each CPU exhibit bad cache locality and degraded performance compared to purely per-cpu data indexing, because concurrency IDs are allocated over various CPUs and cores, therefore losing cache locality of the associated data. Introduce the following changes to improve per-mm-cid cache locality: - Add a "recent_cid" field to the per-mm/cpu mm_cid structure to keep track of which mm_cid value was last used, and use it as a hint to attempt re-allocating the same concurrency ID the next time this mm/cpu needs to allocate a concurrency ID, - Add a per-mm CPUs allowed mask, which keeps track of the union of CPUs allowed for all threads belonging to this mm. This cpumask is only set during the lifetime of the mm, never cleared, so it represents the union of all the CPUs allowed since the beginning of the mm lifetime. (note that the mm_cpumask() is really arch-specific and tailored to the TLB flush needs, and is thus _not_ a viable approach for this) - Add a per-mm nr_cpus_allowed to keep track of the weight of the per-mm CPUs allowed mask (for fast access), - Add a per-mm nr_cids_used to keep track of the highest concurrency ID allocated for the mm. This is used for expanding the concurrency ID allocation within the upper bound defined by: min(mm->nr_cpus_allowed, mm->mm_users) When the next unused CID value reaches this threshold, stop trying to expand the cid allocation and use the first available cid value instead. Spreading allocation to use all the cid values within the range [ 0, min(mm->nr_cpus_allowed, mm->mm_users) - 1 ] improves cache locality while preserving mm_cid compactness within the expected user limits. - In __mm_cid_try_get, only return cid values within the range [ 0, mm->nr_cpus_allowed ] rather than [ 0, nr_cpu_ids ]. This prevents allocating cids above the number of allowed cpus in rare scenarios where cid allocation races with a concurrent remote-clear of the per-mm/cpu cid. This improvement is made possible by the addition of the per-mm CPUs allowed mask. - In sched_mm_cid_migrate_to, use mm->nr_cpus_allowed rather than t->nr_cpus_allowed. This criterion was really meant to compare the number of mm->mm_users to the number of CPUs allowed for the entire mm. Therefore, the prior comparison worked fine when all threads shared the same CPUs allowed mask, but not so much in scenarios where those threads have different masks (e.g. each thread pinned to a single CPU). This improvement is made possible by the addition of the per-mm CPUs allowed mask. * Benchmarks Each thread increments 16kB worth of 8-bit integers in bursts, with a configurable delay between each thread's execution. Each thread run one after the other (no threads run concurrently). The order of thread execution in the sequence is random. The thread execution sequence begins again after all threads have executed. The 16kB areas are allocated with rseq_mempool and indexed by either cpu_id, mm_cid (not cache-local), or cache-local mm_cid. Each thread is pinned to its own core. Testing configurations: 8-core/1-L3: Use 8 cores within a single L3 24-core/24-L3: Use 24 cores, 1 core per L3 192-core/24-L3: Use 192 cores (all cores in the system) 384-thread/24-L3: Use 384 HW threads (all HW threads in the system) Intermittent workload delays between threads: 200ms, 10ms. Hardware: CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Caches (sum of all): L1d: 6 MiB (192 instances) L1i: 6 MiB (192 instances) L2: 192 MiB (192 instances) L3: 768 MiB (24 instances) Each result is an average of 5 test runs. The cache-local speedup is calculated as: (cache-local mm_cid) / (mm_cid). Intermittent workload delay: 200ms per-cpu mm_cid cache-local mm_cid cache-loca= l speedup (ns) (ns) (ns) 8-core/1-L3 1374 19289 1336 14= .4x 24-core/24-L3 2423 26721 1594 16= .7x 192-core/24-L3 2291 15826 2153 7= .3x 384-thread/24-L3 1874 13234 1907 6= .9x Intermittent workload delay: 10ms per-cpu mm_cid cache-local mm_cid cache-loca= l speedup (ns) (ns) (ns) 8-core/1-L3 662 756 686 1= .1x 24-core/24-L3 1378 3648 1035 3= .5x 192-core/24-L3 1439 10833 1482 7= .3x 384-thread/24-L3 1503 10570 1556 6= .8x [ This deprecates the prior "sched: NUMA-aware per-memory-map concurrency I= Ds" patch series with a simpler and more general approach. ] Link: https://lore.kernel.org/lkml/20240823185946.418340-1-mathieu.desnoyer= s@efficios.com/ Signed-off-by: Mathieu Desnoyers Acked-by: Marco Elver Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Valentin Schneider Cc: Mel Gorman Cc: Steven Rostedt Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Ben Segall Cc: Dmitry Vyukov Cc: Marco Elver Cc: Yury Norov Cc: Rasmus Villemoes --- fs/exec.c | 2 +- include/linux/mm_types.h | 72 +++++++++++++++++++++++++++++++++++----- kernel/fork.c | 2 +- kernel/sched/core.c | 22 +++++++----- kernel/sched/sched.h | 47 ++++++++++++++++++-------- 5 files changed, 111 insertions(+), 34 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 6c53920795c2..aaa605529a75 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -990,7 +990,7 @@ static int exec_mmap(struct mm_struct *mm) active_mm =3D tsk->active_mm; tsk->active_mm =3D mm; tsk->mm =3D mm; - mm_init_cid(mm); + mm_init_cid(mm, tsk); /* * This prevents preemption while active_mm is being loaded and * it and mm are being updated, which could cause problems for diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 6e3bdf8e38bc..8b5a185b4d5a 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -782,6 +782,7 @@ struct vm_area_struct { struct mm_cid { u64 time; int cid; + int recent_cid; }; #endif =20 @@ -852,6 +853,27 @@ struct mm_struct { * When the next mm_cid scan is due (in jiffies). */ unsigned long mm_cid_next_scan; + /** + * @nr_cpus_allowed: Number of CPUs allowed for mm. + * + * Number of CPUs allowed in the union of all mm's + * threads allowed CPUs. + */ + atomic_t nr_cpus_allowed; + /** + * @nr_cids_used: Number of used concurrency IDs. + * + * Track the highest concurrency ID allocated for the + * mm: nr_cids_used - 1. + */ + atomic_t nr_cids_used; + /** + * @cpus_allowed_lock: Lock protecting mm cpus_allowed. + * + * Provide mutual exclusion for mm cpus_allowed and + * mm nr_cpus_allowed updates. + */ + spinlock_t cpus_allowed_lock; #endif #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* size of all page tables */ @@ -1170,18 +1192,30 @@ static inline int mm_cid_clear_lazy_put(int cid) return cid & ~MM_CID_LAZY_PUT; } =20 +/* + * mm_cpus_allowed: Union of all mm's threads allowed CPUs. + */ +static inline cpumask_t *mm_cpus_allowed(struct mm_struct *mm) +{ + unsigned long bitmap =3D (unsigned long)mm; + + bitmap +=3D offsetof(struct mm_struct, cpu_bitmap); + /* Skip cpu_bitmap */ + bitmap +=3D cpumask_size(); + return (struct cpumask *)bitmap; +} + /* Accessor for struct mm_struct's cidmask. */ static inline cpumask_t *mm_cidmask(struct mm_struct *mm) { - unsigned long cid_bitmap =3D (unsigned long)mm; + unsigned long cid_bitmap =3D (unsigned long)mm_cpus_allowed(mm); =20 - cid_bitmap +=3D offsetof(struct mm_struct, cpu_bitmap); - /* Skip cpu_bitmap */ + /* Skip mm_cpus_allowed */ cid_bitmap +=3D cpumask_size(); return (struct cpumask *)cid_bitmap; } =20 -static inline void mm_init_cid(struct mm_struct *mm) +static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { int i; =20 @@ -1189,17 +1223,22 @@ static inline void mm_init_cid(struct mm_struct *mm) struct mm_cid *pcpu_cid =3D per_cpu_ptr(mm->pcpu_cid, i); =20 pcpu_cid->cid =3D MM_CID_UNSET; + pcpu_cid->recent_cid =3D MM_CID_UNSET; pcpu_cid->time =3D 0; } + atomic_set(&mm->nr_cpus_allowed, p->nr_cpus_allowed); + atomic_set(&mm->nr_cids_used, 0); + spin_lock_init(&mm->cpus_allowed_lock); + cpumask_copy(mm_cpus_allowed(mm), p->cpus_ptr); cpumask_clear(mm_cidmask(mm)); } =20 -static inline int mm_alloc_cid_noprof(struct mm_struct *mm) +static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_st= ruct *p) { mm->pcpu_cid =3D alloc_percpu_noprof(struct mm_cid); if (!mm->pcpu_cid) return -ENOMEM; - mm_init_cid(mm); + mm_init_cid(mm, p); return 0; } #define mm_alloc_cid(...) alloc_hooks(mm_alloc_cid_noprof(__VA_ARGS__)) @@ -1212,16 +1251,31 @@ static inline void mm_destroy_cid(struct mm_struct = *mm) =20 static inline unsigned int mm_cid_size(void) { - return cpumask_size(); + return 2 * cpumask_size(); /* mm_cpus_allowed(), mm_cidmask(). */ +} + +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct = cpumask *cpumask) +{ + struct cpumask *mm_allowed =3D mm_cpus_allowed(mm); + + if (!mm) + return; + /* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */ + spin_lock(&mm->cpus_allowed_lock); + cpumask_or(mm_allowed, mm_allowed, cpumask); + atomic_set(&mm->nr_cpus_allowed, cpumask_weight(mm_allowed)); + spin_unlock(&mm->cpus_allowed_lock); } #else /* CONFIG_SCHED_MM_CID */ -static inline void mm_init_cid(struct mm_struct *mm) { } -static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; } +static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p= ) { } +static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p= ) { return 0; } static inline void mm_destroy_cid(struct mm_struct *mm) { } + static inline unsigned int mm_cid_size(void) { return 0; } +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct = cpumask *cpumask) { } #endif /* CONFIG_SCHED_MM_CID */ =20 struct mmu_gather; diff --git a/kernel/fork.c b/kernel/fork.c index 60c0b4868fd4..18bf37ae73a5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1298,7 +1298,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm= , struct task_struct *p, if (init_new_context(p, mm)) goto fail_nocontext; =20 - if (mm_alloc_cid(mm)) + if (mm_alloc_cid(mm, p)) goto fail_cid; =20 if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 43e453ab7e20..772a3daf784a 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2691,6 +2691,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct a= ffinity_context *ctx) put_prev_task(rq, p); =20 p->sched_class->set_cpus_allowed(p, ctx); + mm_set_cpus_allowed(p->mm, ctx->new_mask); =20 if (queued) enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK); @@ -10228,6 +10229,7 @@ int __sched_mm_cid_migrate_from_try_steal_cid(struc= t rq *src_rq, */ if (!try_cmpxchg(&src_pcpu_cid->cid, &lazy_cid, MM_CID_UNSET)) return -1; + WRITE_ONCE(src_pcpu_cid->recent_cid, MM_CID_UNSET); return src_cid; } =20 @@ -10240,7 +10242,8 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, str= uct task_struct *t) { struct mm_cid *src_pcpu_cid, *dst_pcpu_cid; struct mm_struct *mm =3D t->mm; - int src_cid, dst_cid, src_cpu; + int src_cid, src_cpu; + bool dst_cid_is_set; struct rq *src_rq; =20 lockdep_assert_rq_held(dst_rq); @@ -10257,9 +10260,9 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, str= uct task_struct *t) * allocation closest to 0 in cases where few threads migrate around * many CPUs. * - * If destination cid is already set, we may have to just clear - * the src cid to ensure compactness in frequent migrations - * scenarios. + * If destination cid or recent cid is already set, we may have + * to just clear the src cid to ensure compactness in frequent + * migrations scenarios. * * It is not useful to clear the src cid when the number of threads is * greater or equal to the number of allowed CPUs, because user-space @@ -10267,9 +10270,9 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, str= uct task_struct *t) * allowed CPUs. */ dst_pcpu_cid =3D per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq)); - dst_cid =3D READ_ONCE(dst_pcpu_cid->cid); - if (!mm_cid_is_unset(dst_cid) && - atomic_read(&mm->mm_users) >=3D t->nr_cpus_allowed) + dst_cid_is_set =3D !mm_cid_is_unset(READ_ONCE(dst_pcpu_cid->cid)) || + !mm_cid_is_unset(READ_ONCE(dst_pcpu_cid->recent_cid)); + if (dst_cid_is_set && atomic_read(&mm->mm_users) >=3D atomic_read(&mm->nr= _cpus_allowed)) return; src_pcpu_cid =3D per_cpu_ptr(mm->pcpu_cid, src_cpu); src_rq =3D cpu_rq(src_cpu); @@ -10280,13 +10283,14 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, s= truct task_struct *t) src_cid); if (src_cid =3D=3D -1) return; - if (!mm_cid_is_unset(dst_cid)) { + if (dst_cid_is_set) { __mm_cid_put(mm, src_cid); return; } /* Move src_cid to dst cpu. */ mm_cid_snapshot_time(dst_rq, mm); WRITE_ONCE(dst_pcpu_cid->cid, src_cid); + WRITE_ONCE(dst_pcpu_cid->recent_cid, src_cid); } =20 static void sched_mm_cid_remote_clear(struct mm_struct *mm, struct mm_cid = *pcpu_cid, @@ -10523,7 +10527,7 @@ void sched_mm_cid_after_execve(struct task_struct *= t) * Matches barrier in sched_mm_cid_remote_clear_old(). */ smp_mb(); - t->last_mm_cid =3D t->mm_cid =3D mm_cid_get(rq, mm); + t->last_mm_cid =3D t->mm_cid =3D mm_cid_get(rq, t, mm); } rseq_set_notify_resume(t); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b1c3588a8f00..f32b8571af5d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3596,24 +3596,40 @@ static inline void mm_cid_put(struct mm_struct *mm) __mm_cid_put(mm, mm_cid_clear_lazy_put(cid)); } =20 -static inline int __mm_cid_try_get(struct mm_struct *mm) +static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct= *mm) { - struct cpumask *cpumask; - int cid; + struct cpumask *cidmask =3D mm_cidmask(mm); + struct mm_cid __percpu *pcpu_cid =3D mm->pcpu_cid; + int cid =3D __this_cpu_read(pcpu_cid->recent_cid); =20 - cpumask =3D mm_cidmask(mm); + /* Try to re-use recent cid. This improves cache locality. */ + if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask)) + return cid; + /* + * Expand cid allocation if used cids are below the number cpus + * allowed and number of threads. Expanding cid allocation as + * much as possible improves cache locality. + */ + cid =3D atomic_read(&mm->nr_cids_used); + while (cid < atomic_read(&mm->nr_cpus_allowed) && cid < atomic_read(&mm->= mm_users)) { + if (!atomic_try_cmpxchg(&mm->nr_cids_used, &cid, cid + 1)) + continue; + if (!cpumask_test_and_set_cpu(cid, cidmask)) + return cid; + } /* + * Find the first available concurrency id. * Retry finding first zero bit if the mask is temporarily * filled. This only happens during concurrent remote-clear * which owns a cid without holding a rq lock. */ for (;;) { - cid =3D cpumask_first_zero(cpumask); - if (cid < nr_cpu_ids) + cid =3D cpumask_first_zero(cidmask); + if (cid < atomic_read(&mm->nr_cpus_allowed)) break; cpu_relax(); } - if (cpumask_test_and_set_cpu(cid, cpumask)) + if (cpumask_test_and_set_cpu(cid, cidmask)) return -1; =20 return cid; @@ -3631,7 +3647,8 @@ static inline void mm_cid_snapshot_time(struct rq *rq= , struct mm_struct *mm) WRITE_ONCE(pcpu_cid->time, rq->clock); } =20 -static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm) +static inline int __mm_cid_get(struct rq *rq, struct task_struct *t, + struct mm_struct *mm) { int cid; =20 @@ -3641,13 +3658,13 @@ static inline int __mm_cid_get(struct rq *rq, struc= t mm_struct *mm) * guarantee forward progress. */ if (!READ_ONCE(use_cid_lock)) { - cid =3D __mm_cid_try_get(mm); + cid =3D __mm_cid_try_get(t, mm); if (cid >=3D 0) goto end; raw_spin_lock(&cid_lock); } else { raw_spin_lock(&cid_lock); - cid =3D __mm_cid_try_get(mm); + cid =3D __mm_cid_try_get(t, mm); if (cid >=3D 0) goto unlock; } @@ -3667,7 +3684,7 @@ static inline int __mm_cid_get(struct rq *rq, struct = mm_struct *mm) * all newcoming allocations observe the use_cid_lock flag set. */ do { - cid =3D __mm_cid_try_get(mm); + cid =3D __mm_cid_try_get(t, mm); cpu_relax(); } while (cid < 0); /* @@ -3684,7 +3701,8 @@ static inline int __mm_cid_get(struct rq *rq, struct = mm_struct *mm) return cid; } =20 -static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm) +static inline int mm_cid_get(struct rq *rq, struct task_struct *t, + struct mm_struct *mm) { struct mm_cid __percpu *pcpu_cid =3D mm->pcpu_cid; struct cpumask *cpumask; @@ -3701,8 +3719,9 @@ static inline int mm_cid_get(struct rq *rq, struct mm= _struct *mm) if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET)) __mm_cid_put(mm, mm_cid_clear_lazy_put(cid)); } - cid =3D __mm_cid_get(rq, mm); + cid =3D __mm_cid_get(rq, t, mm); __this_cpu_write(pcpu_cid->cid, cid); + __this_cpu_write(pcpu_cid->recent_cid, cid); =20 return cid; } @@ -3755,7 +3774,7 @@ static inline void switch_mm_cid(struct rq *rq, prev->mm_cid =3D -1; } if (next->mm_cid_active) - next->last_mm_cid =3D next->mm_cid =3D mm_cid_get(rq, next->mm); + next->last_mm_cid =3D next->mm_cid =3D mm_cid_get(rq, next, next->mm); } =20 #else /* !CONFIG_SCHED_MM_CID: */ --=20 2.39.2 From nobody Thu Nov 28 04:32:27 2024 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 546E51D5AA7 for ; Fri, 4 Oct 2024 00:46:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=167.114.26.122 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728002808; cv=none; b=l8u0+YAQNkqoaExewqWCMji7hZjnpN0ntlscMHnwZfgclbpRZg4+K2fU6l1ATGkroQB4viIPryY6vJQnC4/jmL3JZCg8cqygf5WqjjkgQVgNRU1PRnMC5xi4a9aJhsZpMW4YUs6RMYuBztustIjcBbmuzftbGHQdO4JDr288tJs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728002808; c=relaxed/simple; bh=NxyOUwULccN3G6g9UywqE3RmyvCRJ+v3RJ0Zo5LLUXQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=A2x8ziFT9YFcVgFDtYAy9n3vP4ht4Hh3nhDllqcq7xMucVJcKam+zS3yubxFxFArF2E+dqVSO8F7NPrW+fPyCI4FS66XewnLLfUvtGheZ7Qig2dpxXChL1+7HnbVcb1E8G2IK1dFkVxuuqs22VJoRgg8KO86rGOkhyTg5JU16vg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com; spf=pass smtp.mailfrom=efficios.com; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b=dHNKPkUE; arc=none smtp.client-ip=167.114.26.122 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=efficios.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b="dHNKPkUE" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1728002805; bh=NxyOUwULccN3G6g9UywqE3RmyvCRJ+v3RJ0Zo5LLUXQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=dHNKPkUE98obBOuYptNtzTVQW6IPs+jGdZ7b/EU1RihHpOGlHggMxX7Xgr7BgRIKq OaBuvxfC1GD0MUlEcRwhgdaE/fiAntOxjKgyuEFa9Vt22QvLYtD9JasTEcy6FVS1Ov hkrIAbzm6yxNLnkOOwSFPZBdLNwCkXxyUek51dEy69xnVu2mqiXsqeXdbVAZumSx/y nWBMsw48NIIF7NgsoVr1Vl8jwxeglsG+EQfnuTAlMExs9J0bJsl847tXdSmeE8pMIf CSnRjsZWr/5hMz4PQLbLRe1Bh534dKzx+zd84Mo81mzEiPh3P1SBQIUIFtERBIsuZj SGHaK4fBLTEDQ== Received: from thinkos.internal.efficios.com (unknown [IPv6:2606:6d00:100:4000:cacb:9855:de1f:ded2]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4XKVH915mLzBgj; Thu, 3 Oct 2024 20:46:45 -0400 (EDT) From: Mathieu Desnoyers To: Peter Zijlstra , Ingo Molnar Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Valentin Schneider , Mel Gorman , Steven Rostedt , Vincent Guittot , Dietmar Eggemann , Ben Segall , Yury Norov , Rasmus Villemoes , "Paul E. McKenney" , Shuah Khan , Carlos O'Donell , Florian Weimer Subject: [PATCH v1 2/2] selftests/rseq: Fix mm_cid test failure Date: Thu, 3 Oct 2024 20:44:39 -0400 Message-Id: <20241004004439.1673801-3-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20241004004439.1673801-1-mathieu.desnoyers@efficios.com> References: <20241004004439.1673801-1-mathieu.desnoyers@efficios.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Adapt the rseq.c/rseq.h code to follow GNU C library changes introduced by: commit 2e456ccf0c34 ("Linux: Make __rseq_size useful for feature detection = (bug 31965)") Without this fix, rseq selftests for mm_cid fail: ./run_param_test.sh Default parameters Running test spinlock Running compare-twice test spinlock Running mm_cid test spinlock Error: cpu id getter unavailable Signed-off-by: Mathieu Desnoyers Cc: Peter Zijlstra CC: Boqun Feng CC: "Paul E. McKenney" Cc: Shuah Khan CC: Carlos O'Donell CC: Florian Weimer Acked-by: Shuah Khan --- tools/testing/selftests/rseq/rseq.c | 109 +++++++++++++++++++--------- tools/testing/selftests/rseq/rseq.h | 10 +-- 2 files changed, 76 insertions(+), 43 deletions(-) diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/= rseq/rseq.c index 96e812bdf8a4..3797bb0881da 100644 --- a/tools/testing/selftests/rseq/rseq.c +++ b/tools/testing/selftests/rseq/rseq.c @@ -60,12 +60,6 @@ unsigned int rseq_size =3D -1U; /* Flags used during rseq registration. */ unsigned int rseq_flags; =20 -/* - * rseq feature size supported by the kernel. 0 if the registration was - * unsuccessful. - */ -unsigned int rseq_feature_size =3D -1U; - static int rseq_ownership; static int rseq_reg_success; /* At least one rseq registration has succede= d. */ =20 @@ -111,6 +105,43 @@ int rseq_available(void) } } =20 +/* The rseq areas need to be at least 32 bytes. */ +static +unsigned get_rseq_min_alloc_size(void) +{ + unsigned int alloc_size =3D rseq_size; + + if (alloc_size < ORIG_RSEQ_ALLOC_SIZE) + alloc_size =3D ORIG_RSEQ_ALLOC_SIZE; + return alloc_size; +} + +/* + * Return the feature size supported by the kernel. + * + * Depending on the value returned by getauxval(AT_RSEQ_FEATURE_SIZE): + * + * 0: Return ORIG_RSEQ_FEATURE_SIZE (20) + * > 0: Return the value from getauxval(AT_RSEQ_FEATURE_SIZE). + * + * It should never return a value below ORIG_RSEQ_FEATURE_SIZE. + */ +static +unsigned int get_rseq_kernel_feature_size(void) +{ + unsigned long auxv_rseq_feature_size, auxv_rseq_align; + + auxv_rseq_align =3D getauxval(AT_RSEQ_ALIGN); + assert(!auxv_rseq_align || auxv_rseq_align <=3D RSEQ_THREAD_AREA_ALLOC_SI= ZE); + + auxv_rseq_feature_size =3D getauxval(AT_RSEQ_FEATURE_SIZE); + assert(!auxv_rseq_feature_size || auxv_rseq_feature_size <=3D RSEQ_THREAD= _AREA_ALLOC_SIZE); + if (auxv_rseq_feature_size) + return auxv_rseq_feature_size; + else + return ORIG_RSEQ_FEATURE_SIZE; +} + int rseq_register_current_thread(void) { int rc; @@ -119,7 +150,7 @@ int rseq_register_current_thread(void) /* Treat libc's ownership as a successful registration. */ return 0; } - rc =3D sys_rseq(&__rseq_abi, rseq_size, 0, RSEQ_SIG); + rc =3D sys_rseq(&__rseq_abi, get_rseq_min_alloc_size(), 0, RSEQ_SIG); if (rc) { if (RSEQ_READ_ONCE(rseq_reg_success)) { /* Incoherent success/failure within process. */ @@ -140,28 +171,12 @@ int rseq_unregister_current_thread(void) /* Treat libc's ownership as a successful unregistration. */ return 0; } - rc =3D sys_rseq(&__rseq_abi, rseq_size, RSEQ_ABI_FLAG_UNREGISTER, RSEQ_SI= G); + rc =3D sys_rseq(&__rseq_abi, get_rseq_min_alloc_size(), RSEQ_ABI_FLAG_UNR= EGISTER, RSEQ_SIG); if (rc) return -1; return 0; } =20 -static -unsigned int get_rseq_feature_size(void) -{ - unsigned long auxv_rseq_feature_size, auxv_rseq_align; - - auxv_rseq_align =3D getauxval(AT_RSEQ_ALIGN); - assert(!auxv_rseq_align || auxv_rseq_align <=3D RSEQ_THREAD_AREA_ALLOC_SI= ZE); - - auxv_rseq_feature_size =3D getauxval(AT_RSEQ_FEATURE_SIZE); - assert(!auxv_rseq_feature_size || auxv_rseq_feature_size <=3D RSEQ_THREAD= _AREA_ALLOC_SIZE); - if (auxv_rseq_feature_size) - return auxv_rseq_feature_size; - else - return ORIG_RSEQ_FEATURE_SIZE; -} - static __attribute__((constructor)) void rseq_init(void) { @@ -178,28 +193,53 @@ void rseq_init(void) } if (libc_rseq_size_p && libc_rseq_offset_p && libc_rseq_flags_p && *libc_rseq_size_p !=3D 0) { + unsigned int libc_rseq_size; + /* rseq registration owned by glibc */ rseq_offset =3D *libc_rseq_offset_p; - rseq_size =3D *libc_rseq_size_p; + libc_rseq_size =3D *libc_rseq_size_p; rseq_flags =3D *libc_rseq_flags_p; - rseq_feature_size =3D get_rseq_feature_size(); - if (rseq_feature_size > rseq_size) - rseq_feature_size =3D rseq_size; + + /* + * Previous versions of glibc expose the value + * 32 even though the kernel only supported 20 + * bytes initially. Therefore treat 32 as a + * special-case. glibc 2.40 exposes a 20 bytes + * __rseq_size without using getauxval(3) to + * query the supported size, while still allocating a 32 + * bytes area. Also treat 20 as a special-case. + * + * Special-cases are handled by using the following + * value as active feature set size: + * + * rseq_size =3D min(32, get_rseq_kernel_feature_size()) + */ + switch (libc_rseq_size) { + case ORIG_RSEQ_FEATURE_SIZE: /* Fallthrough. */ + case ORIG_RSEQ_ALLOC_SIZE: + { + unsigned int rseq_kernel_feature_size =3D get_rseq_kernel_feature_size(= ); + + if (rseq_kernel_feature_size < ORIG_RSEQ_ALLOC_SIZE) + rseq_size =3D rseq_kernel_feature_size; + else + rseq_size =3D ORIG_RSEQ_ALLOC_SIZE; + break; + } + default: + /* Otherwise just use the __rseq_size from libc as rseq_size. */ + rseq_size =3D libc_rseq_size; + break; + } return; } rseq_ownership =3D 1; if (!rseq_available()) { rseq_size =3D 0; - rseq_feature_size =3D 0; return; } rseq_offset =3D (void *)&__rseq_abi - rseq_thread_pointer(); rseq_flags =3D 0; - rseq_feature_size =3D get_rseq_feature_size(); - if (rseq_feature_size =3D=3D ORIG_RSEQ_FEATURE_SIZE) - rseq_size =3D ORIG_RSEQ_ALLOC_SIZE; - else - rseq_size =3D RSEQ_THREAD_AREA_ALLOC_SIZE; } =20 static __attribute__((destructor)) @@ -209,7 +249,6 @@ void rseq_exit(void) return; rseq_offset =3D 0; rseq_size =3D -1U; - rseq_feature_size =3D -1U; rseq_ownership =3D 0; } =20 diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/= rseq/rseq.h index d7364ea4d201..4e217b620e0c 100644 --- a/tools/testing/selftests/rseq/rseq.h +++ b/tools/testing/selftests/rseq/rseq.h @@ -68,12 +68,6 @@ extern unsigned int rseq_size; /* Flags used during rseq registration. */ extern unsigned int rseq_flags; =20 -/* - * rseq feature size supported by the kernel. 0 if the registration was - * unsuccessful. - */ -extern unsigned int rseq_feature_size; - enum rseq_mo { RSEQ_MO_RELAXED =3D 0, RSEQ_MO_CONSUME =3D 1, /* Unused */ @@ -193,7 +187,7 @@ static inline uint32_t rseq_current_cpu(void) =20 static inline bool rseq_node_id_available(void) { - return (int) rseq_feature_size >=3D rseq_offsetofend(struct rseq_abi, nod= e_id); + return (int) rseq_size >=3D rseq_offsetofend(struct rseq_abi, node_id); } =20 /* @@ -207,7 +201,7 @@ static inline uint32_t rseq_current_node_id(void) =20 static inline bool rseq_mm_cid_available(void) { - return (int) rseq_feature_size >=3D rseq_offsetofend(struct rseq_abi, mm_= cid); + return (int) rseq_size >=3D rseq_offsetofend(struct rseq_abi, mm_cid); } =20 static inline uint32_t rseq_current_mm_cid(void) --=20 2.39.2