From nobody Fri Feb 13 09:49:55 2026 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E081860EF9; Mon, 15 Apr 2024 15:21:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=167.114.26.122 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713194488; cv=none; b=EEg40nsbpQwFR1xkLw7nklkMGxmrBJTk5Vcq7bPR5CCicHbI73o8gn3ZYNtLnyFs6wo03EhW2Vbosz7MINTiykgqpSQQOOWKKElxMERYtB5nko4ToaxzuGmqXamJi53qrz/qTxAkG20JvCK+u+dP5KwwL0ARxgySaM4nBt7iECs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713194488; c=relaxed/simple; bh=osgSadJZvQn1wrb//TYw1z/uIXekqzai4TYartnsTww=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=ZCiwJMpkdMrl4kSo0Pq/WTw8k3KwtnRxZNo2lXySPg+jvKyNdz5/Yy9HhpVZwBvavNQur/MjcKWfjzFUsn2DDwEphQdkXI6bUvlU7Xvnq3+EOOqKIb6rFfCx5vBOvmy3w6cVDHO94+31/vTrTtOQ/Y4efPUDGf59Qb/7wxYpgP0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com; spf=pass smtp.mailfrom=efficios.com; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b=YIpnWYr3; arc=none smtp.client-ip=167.114.26.122 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=efficios.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b="YIpnWYr3" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1713194478; bh=osgSadJZvQn1wrb//TYw1z/uIXekqzai4TYartnsTww=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=YIpnWYr3OBNaVcOtObbIoF+WFBDFoEQ/+FXjlJGb5vZB+HrgGpmxY8qVUM3yt7rbK pqGEs1MaSgpWZeTEHe7l9yu17qAlGcj5gxBTDhZZ4LLh/htXEt0TMDTv+04thWG6KA e/rZHM8+gRKKta041QgyUJBZ6PCTJyuRk0rI3Kscx8JaOGTMnlVyCCYLAEcsjbY+8Z PXJtu1qOkAqYXDYv+HAUIA+d+DjAeGzD+fSsD6slc5Lqnsq+fP0NRGuMC6qvhllBEx kFI0glyI2mjasoAdAkLHJ5EbBqgqvHT3QtrW/rDNII+MhcDO03FlLwq8JNwPDHaFpp K1ISv5JuAFfPg== Received: from thinkos.internal.efficios.com (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4VJ9qf0ksTzvSD; Mon, 15 Apr 2024 11:21:18 -0400 (EDT) From: Mathieu Desnoyers To: Ingo Molnar , Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Steven Rostedt , Vincent Guittot , Juri Lelli , Dietmar Eggemann , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , "levi . yun" , Catalin Marinas , Mark Rutland , Will Deacon , Aaron Lu , Thomas Gleixner , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Arnd Bergmann , Andrew Morton , linux-arch@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, stable@vger.kernel.org Subject: [PATCH 1/2] sched: Add missing memory barrier in switch_mm_cid Date: Mon, 15 Apr 2024 11:21:13 -0400 Message-Id: <20240415152114.59122-2-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20240415152114.59122-1-mathieu.desnoyers@efficios.com> References: <20240415152114.59122-1-mathieu.desnoyers@efficios.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core scheduler code has depended upon since commit: commit 223baf9d17f25 ("sched: Fix performance regression introduced by = mm_cid") If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the actively used cid when it fails to observe active task after it sets lazy_put. There *is* a memory barrier between storing to rq->curr and _return to userspace_ (as required by membarrier), but the rseq mm_cid has stricter requirements: the barrier needs to be issued between store to rq->curr and switch_mm_cid(), which happens earlier than: - spin_unlock(), - switch_to(). So it's fine when the architecture switch_mm() happens to have that barrier already, but less so when the architecture only provides the full barrier in switch_to() or spin_unlock(). It is a bug in the rseq switch_mm_cid() implementation. All architectures that don't have memory barriers in switch_mm(), but rather have the full barrier either in finish_lock_switch() or switch_to() have them too late for the needs of switch_mm_cid(). Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the generic barrier.h header, and use it in switch_mm_cid() for scheduler transitions where switch_mm() is expected to provide a memory barrier. Architectures can override smp_mb__after_switch_mm() if their switch_mm() implementation provides an implicit memory barrier. Override it with a no-op on x86 which implicitly provide this memory barrier by writing to CR3. Link: https://lore.kernel.org/lkml/20240305145335.2696125-1-yeoreum.yun@arm= .com/ Reported-by: levi.yun Signed-off-by: Mathieu Desnoyers Reviewed-by: Catalin Marinas # for arm64 Acked-by: Dave Hansen # for x86 Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_ci= d") Cc: # 6.4.x Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Vincent Guittot Cc: Juri Lelli Cc: Dietmar Eggemann Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: levi.yun Cc: Mathieu Desnoyers Cc: Catalin Marinas Cc: Mark Rutland Cc: Will Deacon Cc: Aaron Lu Cc: Thomas Gleixner Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: Arnd Bergmann Cc: Andrew Morton Cc: linux-arch@vger.kernel.org Cc: linux-mm@kvack.org Cc: x86@kernel.org --- arch/x86/include/asm/barrier.h | 3 +++ include/asm-generic/barrier.h | 8 ++++++++ kernel/sched/sched.h | 20 ++++++++++++++------ 3 files changed, 25 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h index fe1e7e3cc844..63bdc6b85219 100644 --- a/arch/x86/include/asm/barrier.h +++ b/arch/x86/include/asm/barrier.h @@ -79,6 +79,9 @@ do { \ #define __smp_mb__before_atomic() do { } while (0) #define __smp_mb__after_atomic() do { } while (0) =20 +/* Writing to CR3 provides a full memory barrier in switch_mm(). */ +#define smp_mb__after_switch_mm() do { } while (0) + #include =20 #endif /* _ASM_X86_BARRIER_H */ diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h index 0c0695763bea..dc32b96140c1 100644 --- a/include/asm-generic/barrier.h +++ b/include/asm-generic/barrier.h @@ -294,5 +294,13 @@ do { \ #define io_stop_wc() do { } while (0) #endif =20 +/* + * Architectures that guarantee an implicit smp_mb() in switch_mm() + * can override smp_mb__after_switch_mm. + */ +#ifndef smp_mb__after_switch_mm +#define smp_mb__after_switch_mm() smp_mb() +#endif + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_GENERIC_BARRIER_H */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d2242679239e..d2895d264196 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -79,6 +79,8 @@ # include #endif =20 +#include + #include "cpupri.h" #include "cpudeadline.h" =20 @@ -3445,13 +3447,19 @@ static inline void switch_mm_cid(struct rq *rq, * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu]. * Provide it here. */ - if (!prev->mm) // from kernel + if (!prev->mm) { // from kernel smp_mb(); - /* - * user -> user transition guarantees a memory barrier through - * switch_mm() when current->mm changes. If current->mm is - * unchanged, no barrier is needed. - */ + } else { // from user + /* + * user -> user transition relies on an implicit + * memory barrier in switch_mm() when + * current->mm changes. If the architecture + * switch_mm() does not have an implicit memory + * barrier, it is emitted here. If current->mm + * is unchanged, no barrier is needed. + */ + smp_mb__after_switch_mm(); + } } if (prev->mm_cid_active) { mm_cid_snapshot_time(rq, prev->mm); --=20 2.39.2 From nobody Fri Feb 13 09:49:55 2026 Received: from smtpout.efficios.com (smtpout.efficios.com [167.114.26.122]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CB653757FD; Mon, 15 Apr 2024 15:21:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=167.114.26.122 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713194488; cv=none; b=CsbKPLcu+Hn+c10Xbc2D6Zc/NsepNSDDn8cQp5Zz/WVjaZZkSjdUdylnl+9c8OqPsYSJ74EtxfnJ32/WJ/RgqKcqAtWW/Cyj06HBak7TETJI/cc/z+wwAr7M5OGPF4icQzqGWv2MSiVg2zhgKREYxrdneL4EwbENRuwJC9t6zjI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713194488; c=relaxed/simple; bh=b9XYtPZFX8rSd40j1Mr6Vwh3YnqZdJjiiGq5LiMjyew=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=kh8J78Y7kph09Zo2awEcQTtq4CVjFWrkQJDxUtzt9Rv6y7NaIBPQfeoIEeiRjhBtaHLIi6HQQpk5jHPmiXbp6sB9yE2kkfGA5s7OCi1zTLTkuFBgJUmdhrIn+XtaC+oRMh0JwOYed7msCaTSo0SiZNJkdJj/wRU0QFVEDcSDz+M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com; spf=pass smtp.mailfrom=efficios.com; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b=GAH6fifi; arc=none smtp.client-ip=167.114.26.122 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=efficios.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=efficios.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=efficios.com header.i=@efficios.com header.b="GAH6fifi" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com; s=smtpout1; t=1713194479; bh=b9XYtPZFX8rSd40j1Mr6Vwh3YnqZdJjiiGq5LiMjyew=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=GAH6fifiACkpTiicD55tC+a3o5n0ZYZ30clKFOQtUStm/R76Z7z1a8VjoOwJjhSgi Fs/Cu+APyYsWpZhFyaew8O6V1UaN9NmfRNd4qcy/xkBurPaQaB+d358cHbOQ/ic7qw zYuQvci4eFSFDacKvCy73c9Tiiby91oyECNg7v+EXDuXIKKW7SLO5TDyHLjv9llmfK vypkSTLXi5ilnF89qYrVekej1zgyAmQcW4gwFhAnXiJbBkd5cLyLrzY05tTOuuRsBU s+1/9VeUNldo7AGbPIJ1SC8N+k+IDc0tMIQDj/gSwZCtUmtAySZS60llmuUIA9lo/5 UYmJQYn/+P5pg== Received: from thinkos.internal.efficios.com (192-222-143-198.qc.cable.ebox.net [192.222.143.198]) by smtpout.efficios.com (Postfix) with ESMTPSA id 4VJ9qf4p5kzvWF; Mon, 15 Apr 2024 11:21:18 -0400 (EDT) From: Mathieu Desnoyers To: Ingo Molnar , Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers , Steven Rostedt , Vincent Guittot , Juri Lelli , Dietmar Eggemann , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , "levi . yun" , Catalin Marinas , Mark Rutland , Will Deacon , Aaron Lu , Thomas Gleixner , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Arnd Bergmann , Andrew Morton , linux-arch@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org Subject: [PATCH 2/2] sched: Move mm_cid code from sched.h to core.c Date: Mon, 15 Apr 2024 11:21:14 -0400 Message-Id: <20240415152114.59122-3-mathieu.desnoyers@efficios.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20240415152114.59122-1-mathieu.desnoyers@efficios.com> References: <20240415152114.59122-1-mathieu.desnoyers@efficios.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The mm_cid code in sched/sched.h is only used from sched/core.c. Move it to the compile unit where it belongs. While reviewing mm_cid functions which were already in sched/core.c, I noticed that a few of them are non-static even though they are only used from core.c. Make those functions static inline. For sake of keeping things consistent, mm_cid functions only marked "static" are now marked "static inline". The variables cid_lock and use_cid_lock are only used from core.c, mark them as static. Moving from non-static to static inline for: - sched_mm_cid_migrate_from - init_sched_mm_cid - task_tick_mm_cid And the forced inlining of: - __sched_mm_cid_migrate_from_fetch_cid - __sched_mm_cid_migrate_from_try_steal_cid - sched_mm_cid_migrate_to - sched_mm_cid_remote_clear - sched_mm_cid_remote_clear_old - sched_mm_cid_remote_clear_weight slightly improves the size of sched/core.o on x86-64 (in bytes): text data before: 192261 58677 after: 191629 58641 ----------------------------- delta: -632 -36 Signed-off-by: Mathieu Desnoyers Cc: Ingo Molnar Cc: Peter Zijlstra --- kernel/sched/core.c | 277 +++++++++++++++++++++++++++++++++++++++---- kernel/sched/sched.h | 241 ------------------------------------- 2 files changed, 257 insertions(+), 261 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7019a40457a6..57b03d874530 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -457,6 +457,22 @@ sched_core_dequeue(struct rq *rq, struct task_struct *= p, int flags) { } =20 #endif /* CONFIG_SCHED_CORE */ =20 +#ifdef CONFIG_SCHED_MM_CID +static inline void switch_mm_cid(struct rq *rq, struct task_struct *prev, + struct task_struct *next); +static inline void sched_mm_cid_migrate_from(struct task_struct *t); +static inline void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_= struct *t); +static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *cur= r); +static inline void init_sched_mm_cid(struct task_struct *t); +#else +static inline void switch_mm_cid(struct rq *rq, struct task_struct *prev, + struct task_struct *next) { } +static inline void sched_mm_cid_migrate_from(struct task_struct *t) { } +static inline void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_= struct *t) { } +static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *cur= r) { } +static inline void init_sched_mm_cid(struct task_struct *t) { } +#endif + /* * Serialization rules: * @@ -11551,6 +11567,9 @@ void call_trace_sched_update_nr_running(struct rq *= rq, int count) =20 #ifdef CONFIG_SCHED_MM_CID =20 +#define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */ +#define MM_CID_SCAN_DELAY 100 /* 100ms */ + /* * @cid_lock: Guarantee forward-progress of cid allocation. * @@ -11558,7 +11577,7 @@ void call_trace_sched_update_nr_running(struct rq *= rq, int count) * is only used when contention is detected by the lock-free allocation so * forward progress can be guaranteed. */ -DEFINE_RAW_SPINLOCK(cid_lock); +static DEFINE_RAW_SPINLOCK(cid_lock); =20 /* * @use_cid_lock: Select cid allocation behavior: lock-free vs spinlock. @@ -11569,7 +11588,7 @@ DEFINE_RAW_SPINLOCK(cid_lock); * completes and sets @use_cid_lock back to 0. This guarantees forward pro= gress * of a cid allocation. */ -int use_cid_lock; +static int use_cid_lock; =20 /* * mm_cid remote-clear implements a lock-free algorithm to clear per-mm/cp= u cid @@ -11659,15 +11678,233 @@ int use_cid_lock; * because this would UNSET a cid which is actively used. */ =20 -void sched_mm_cid_migrate_from(struct task_struct *t) +static inline void __mm_cid_put(struct mm_struct *mm, int cid) +{ + if (cid < 0) + return; + cpumask_clear_cpu(cid, mm_cidmask(mm)); +} + +/* + * The per-mm/cpu cid can have the MM_CID_LAZY_PUT flag set or transition = to + * the MM_CID_UNSET state without holding the rq lock, but the rq lock nee= ds to + * be held to transition to other states. + * + * State transitions synchronized with cmpxchg or try_cmpxchg need to be + * consistent across cpus, which prevents use of this_cpu_cmpxchg. + */ +static inline void mm_cid_put_lazy(struct task_struct *t) +{ + struct mm_struct *mm =3D t->mm; + struct mm_cid __percpu *pcpu_cid =3D mm->pcpu_cid; + int cid; + + lockdep_assert_irqs_disabled(); + cid =3D __this_cpu_read(pcpu_cid->cid); + if (!mm_cid_is_lazy_put(cid) || + !try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET)) + return; + __mm_cid_put(mm, mm_cid_clear_lazy_put(cid)); +} + +static inline int mm_cid_pcpu_unset(struct mm_struct *mm) +{ + struct mm_cid __percpu *pcpu_cid =3D mm->pcpu_cid; + int cid, res; + + lockdep_assert_irqs_disabled(); + cid =3D __this_cpu_read(pcpu_cid->cid); + for (;;) { + if (mm_cid_is_unset(cid)) + return MM_CID_UNSET; + /* + * Attempt transition from valid or lazy-put to unset. + */ + res =3D cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, cid, MM_CID_UNSET); + if (res =3D=3D cid) + break; + cid =3D res; + } + return cid; +} + +static inline void mm_cid_put(struct mm_struct *mm) +{ + int cid; + + lockdep_assert_irqs_disabled(); + cid =3D mm_cid_pcpu_unset(mm); + if (cid =3D=3D MM_CID_UNSET) + return; + __mm_cid_put(mm, mm_cid_clear_lazy_put(cid)); +} + +static inline int __mm_cid_try_get(struct mm_struct *mm) +{ + struct cpumask *cpumask; + int cid; + + cpumask =3D mm_cidmask(mm); + /* + * Retry finding first zero bit if the mask is temporarily + * filled. This only happens during concurrent remote-clear + * which owns a cid without holding a rq lock. + */ + for (;;) { + cid =3D cpumask_first_zero(cpumask); + if (cid < nr_cpu_ids) + break; + cpu_relax(); + } + if (cpumask_test_and_set_cpu(cid, cpumask)) + return -1; + return cid; +} + +/* + * Save a snapshot of the current runqueue time of this cpu + * with the per-cpu cid value, allowing to estimate how recently it was us= ed. + */ +static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *m= m) +{ + struct mm_cid *pcpu_cid =3D per_cpu_ptr(mm->pcpu_cid, cpu_of(rq)); + + lockdep_assert_rq_held(rq); + WRITE_ONCE(pcpu_cid->time, rq->clock); +} + +static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm) +{ + int cid; + + /* + * All allocations (even those using the cid_lock) are lock-free. If + * use_cid_lock is set, hold the cid_lock to perform cid allocation to + * guarantee forward progress. + */ + if (!READ_ONCE(use_cid_lock)) { + cid =3D __mm_cid_try_get(mm); + if (cid >=3D 0) + goto end; + raw_spin_lock(&cid_lock); + } else { + raw_spin_lock(&cid_lock); + cid =3D __mm_cid_try_get(mm); + if (cid >=3D 0) + goto unlock; + } + + /* + * cid concurrently allocated. Retry while forcing following + * allocations to use the cid_lock to ensure forward progress. + */ + WRITE_ONCE(use_cid_lock, 1); + /* + * Set use_cid_lock before allocation. Only care about program order + * because this is only required for forward progress. + */ + barrier(); + /* + * Retry until it succeeds. It is guaranteed to eventually succeed once + * all newcoming allocations observe the use_cid_lock flag set. + */ + do { + cid =3D __mm_cid_try_get(mm); + cpu_relax(); + } while (cid < 0); + /* + * Allocate before clearing use_cid_lock. Only care about + * program order because this is for forward progress. + */ + barrier(); + WRITE_ONCE(use_cid_lock, 0); +unlock: + raw_spin_unlock(&cid_lock); +end: + mm_cid_snapshot_time(rq, mm); + return cid; +} + +static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm) +{ + struct mm_cid __percpu *pcpu_cid =3D mm->pcpu_cid; + struct cpumask *cpumask; + int cid; + + lockdep_assert_rq_held(rq); + cpumask =3D mm_cidmask(mm); + cid =3D __this_cpu_read(pcpu_cid->cid); + if (mm_cid_is_valid(cid)) { + mm_cid_snapshot_time(rq, mm); + return cid; + } + if (mm_cid_is_lazy_put(cid)) { + if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET)) + __mm_cid_put(mm, mm_cid_clear_lazy_put(cid)); + } + cid =3D __mm_cid_get(rq, mm); + __this_cpu_write(pcpu_cid->cid, cid); + return cid; +} + +static inline void switch_mm_cid(struct rq *rq, struct task_struct *prev, + struct task_struct *next) +{ + /* + * Provide a memory barrier between rq->curr store and load of + * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition. + * + * Should be adapted if context_switch() is modified. + */ + if (!next->mm) { // to kernel + /* + * user -> kernel transition does not guarantee a barrier, but + * we can use the fact that it performs an atomic operation in + * mmgrab(). + */ + if (prev->mm) // from user + smp_mb__after_mmgrab(); + /* + * kernel -> kernel transition does not change rq->curr->mm + * state. It stays NULL. + */ + } else { // to user + /* + * kernel -> user transition does not provide a barrier + * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu]. + * Provide it here. + */ + if (!prev->mm) { // from kernel + smp_mb(); + } else { // from user + /* + * user -> user transition relies on an implicit + * memory barrier in switch_mm() when + * current->mm changes. If the architecture + * switch_mm() does not have an implicit memory + * barrier, it is emitted here. If current->mm + * is unchanged, no barrier is needed. + */ + smp_mb__after_switch_mm(); + } + } + if (prev->mm_cid_active) { + mm_cid_snapshot_time(rq, prev->mm); + mm_cid_put_lazy(prev); + prev->mm_cid =3D -1; + } + if (next->mm_cid_active) + next->last_mm_cid =3D next->mm_cid =3D mm_cid_get(rq, next->mm); +} + +static inline void sched_mm_cid_migrate_from(struct task_struct *t) { t->migrate_from_cpu =3D task_cpu(t); } =20 -static -int __sched_mm_cid_migrate_from_fetch_cid(struct rq *src_rq, - struct task_struct *t, - struct mm_cid *src_pcpu_cid) +static inline int __sched_mm_cid_migrate_from_fetch_cid(struct rq *src_rq, + struct task_struct *t, + struct mm_cid *src_pcpu_cid) { struct mm_struct *mm =3D t->mm; struct task_struct *src_task; @@ -11703,11 +11940,10 @@ int __sched_mm_cid_migrate_from_fetch_cid(struct = rq *src_rq, return src_cid; } =20 -static -int __sched_mm_cid_migrate_from_try_steal_cid(struct rq *src_rq, - struct task_struct *t, - struct mm_cid *src_pcpu_cid, - int src_cid) +static inline int __sched_mm_cid_migrate_from_try_steal_cid(struct rq *src= _rq, + struct task_struct *t, + struct mm_cid *src_pcpu_cid, + int src_cid) { struct task_struct *src_task; struct mm_struct *mm =3D t->mm; @@ -11767,7 +12003,7 @@ int __sched_mm_cid_migrate_from_try_steal_cid(struc= t rq *src_rq, * Interrupts are disabled, which keeps the window of cid ownership withou= t the * source rq lock held small. */ -void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t) +static inline void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_= struct *t) { struct mm_cid *src_pcpu_cid, *dst_pcpu_cid; struct mm_struct *mm =3D t->mm; @@ -11820,8 +12056,9 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, str= uct task_struct *t) WRITE_ONCE(dst_pcpu_cid->cid, src_cid); } =20 -static void sched_mm_cid_remote_clear(struct mm_struct *mm, struct mm_cid = *pcpu_cid, - int cpu) +static inline void sched_mm_cid_remote_clear(struct mm_struct *mm, + struct mm_cid *pcpu_cid, + int cpu) { struct rq *rq =3D cpu_rq(cpu); struct task_struct *t; @@ -11876,7 +12113,7 @@ static void sched_mm_cid_remote_clear(struct mm_str= uct *mm, struct mm_cid *pcpu_ } } =20 -static void sched_mm_cid_remote_clear_old(struct mm_struct *mm, int cpu) +static inline void sched_mm_cid_remote_clear_old(struct mm_struct *mm, int= cpu) { struct rq *rq =3D cpu_rq(cpu); struct mm_cid *pcpu_cid; @@ -11908,8 +12145,8 @@ static void sched_mm_cid_remote_clear_old(struct mm= _struct *mm, int cpu) sched_mm_cid_remote_clear(mm, pcpu_cid, cpu); } =20 -static void sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu, - int weight) +static inline void sched_mm_cid_remote_clear_weight(struct mm_struct *mm, = int cpu, + int weight) { struct mm_cid *pcpu_cid; int cid; @@ -11965,7 +12202,7 @@ static void task_mm_cid_work(struct callback_head *= work) sched_mm_cid_remote_clear_weight(mm, cpu, weight); } =20 -void init_sched_mm_cid(struct task_struct *t) +static inline void init_sched_mm_cid(struct task_struct *t) { struct mm_struct *mm =3D t->mm; int mm_users =3D 0; @@ -11979,7 +12216,7 @@ void init_sched_mm_cid(struct task_struct *t) init_task_work(&t->cid_work, task_mm_cid_work); } =20 -void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) +static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *cur= r) { struct callback_head *work =3D &curr->cid_work; unsigned long now =3D jiffies; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index d2895d264196..1b8e3e23ef40 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3237,247 +3237,6 @@ extern int sched_dynamic_mode(const char *str); extern void sched_dynamic_update(int mode); #endif =20 -#ifdef CONFIG_SCHED_MM_CID - -#define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */ -#define MM_CID_SCAN_DELAY 100 /* 100ms */ - -extern raw_spinlock_t cid_lock; -extern int use_cid_lock; - -extern void sched_mm_cid_migrate_from(struct task_struct *t); -extern void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct = *t); -extern void task_tick_mm_cid(struct rq *rq, struct task_struct *curr); -extern void init_sched_mm_cid(struct task_struct *t); - -static inline void __mm_cid_put(struct mm_struct *mm, int cid) -{ - if (cid < 0) - return; - cpumask_clear_cpu(cid, mm_cidmask(mm)); -} - -/* - * The per-mm/cpu cid can have the MM_CID_LAZY_PUT flag set or transition = to - * the MM_CID_UNSET state without holding the rq lock, but the rq lock nee= ds to - * be held to transition to other states. - * - * State transitions synchronized with cmpxchg or try_cmpxchg need to be - * consistent across cpus, which prevents use of this_cpu_cmpxchg. - */ -static inline void mm_cid_put_lazy(struct task_struct *t) -{ - struct mm_struct *mm =3D t->mm; - struct mm_cid __percpu *pcpu_cid =3D mm->pcpu_cid; - int cid; - - lockdep_assert_irqs_disabled(); - cid =3D __this_cpu_read(pcpu_cid->cid); - if (!mm_cid_is_lazy_put(cid) || - !try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET)) - return; - __mm_cid_put(mm, mm_cid_clear_lazy_put(cid)); -} - -static inline int mm_cid_pcpu_unset(struct mm_struct *mm) -{ - struct mm_cid __percpu *pcpu_cid =3D mm->pcpu_cid; - int cid, res; - - lockdep_assert_irqs_disabled(); - cid =3D __this_cpu_read(pcpu_cid->cid); - for (;;) { - if (mm_cid_is_unset(cid)) - return MM_CID_UNSET; - /* - * Attempt transition from valid or lazy-put to unset. - */ - res =3D cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, cid, MM_CID_UNSET); - if (res =3D=3D cid) - break; - cid =3D res; - } - return cid; -} - -static inline void mm_cid_put(struct mm_struct *mm) -{ - int cid; - - lockdep_assert_irqs_disabled(); - cid =3D mm_cid_pcpu_unset(mm); - if (cid =3D=3D MM_CID_UNSET) - return; - __mm_cid_put(mm, mm_cid_clear_lazy_put(cid)); -} - -static inline int __mm_cid_try_get(struct mm_struct *mm) -{ - struct cpumask *cpumask; - int cid; - - cpumask =3D mm_cidmask(mm); - /* - * Retry finding first zero bit if the mask is temporarily - * filled. This only happens during concurrent remote-clear - * which owns a cid without holding a rq lock. - */ - for (;;) { - cid =3D cpumask_first_zero(cpumask); - if (cid < nr_cpu_ids) - break; - cpu_relax(); - } - if (cpumask_test_and_set_cpu(cid, cpumask)) - return -1; - return cid; -} - -/* - * Save a snapshot of the current runqueue time of this cpu - * with the per-cpu cid value, allowing to estimate how recently it was us= ed. - */ -static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *m= m) -{ - struct mm_cid *pcpu_cid =3D per_cpu_ptr(mm->pcpu_cid, cpu_of(rq)); - - lockdep_assert_rq_held(rq); - WRITE_ONCE(pcpu_cid->time, rq->clock); -} - -static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm) -{ - int cid; - - /* - * All allocations (even those using the cid_lock) are lock-free. If - * use_cid_lock is set, hold the cid_lock to perform cid allocation to - * guarantee forward progress. - */ - if (!READ_ONCE(use_cid_lock)) { - cid =3D __mm_cid_try_get(mm); - if (cid >=3D 0) - goto end; - raw_spin_lock(&cid_lock); - } else { - raw_spin_lock(&cid_lock); - cid =3D __mm_cid_try_get(mm); - if (cid >=3D 0) - goto unlock; - } - - /* - * cid concurrently allocated. Retry while forcing following - * allocations to use the cid_lock to ensure forward progress. - */ - WRITE_ONCE(use_cid_lock, 1); - /* - * Set use_cid_lock before allocation. Only care about program order - * because this is only required for forward progress. - */ - barrier(); - /* - * Retry until it succeeds. It is guaranteed to eventually succeed once - * all newcoming allocations observe the use_cid_lock flag set. - */ - do { - cid =3D __mm_cid_try_get(mm); - cpu_relax(); - } while (cid < 0); - /* - * Allocate before clearing use_cid_lock. Only care about - * program order because this is for forward progress. - */ - barrier(); - WRITE_ONCE(use_cid_lock, 0); -unlock: - raw_spin_unlock(&cid_lock); -end: - mm_cid_snapshot_time(rq, mm); - return cid; -} - -static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm) -{ - struct mm_cid __percpu *pcpu_cid =3D mm->pcpu_cid; - struct cpumask *cpumask; - int cid; - - lockdep_assert_rq_held(rq); - cpumask =3D mm_cidmask(mm); - cid =3D __this_cpu_read(pcpu_cid->cid); - if (mm_cid_is_valid(cid)) { - mm_cid_snapshot_time(rq, mm); - return cid; - } - if (mm_cid_is_lazy_put(cid)) { - if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET)) - __mm_cid_put(mm, mm_cid_clear_lazy_put(cid)); - } - cid =3D __mm_cid_get(rq, mm); - __this_cpu_write(pcpu_cid->cid, cid); - return cid; -} - -static inline void switch_mm_cid(struct rq *rq, - struct task_struct *prev, - struct task_struct *next) -{ - /* - * Provide a memory barrier between rq->curr store and load of - * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition. - * - * Should be adapted if context_switch() is modified. - */ - if (!next->mm) { // to kernel - /* - * user -> kernel transition does not guarantee a barrier, but - * we can use the fact that it performs an atomic operation in - * mmgrab(). - */ - if (prev->mm) // from user - smp_mb__after_mmgrab(); - /* - * kernel -> kernel transition does not change rq->curr->mm - * state. It stays NULL. - */ - } else { // to user - /* - * kernel -> user transition does not provide a barrier - * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu]. - * Provide it here. - */ - if (!prev->mm) { // from kernel - smp_mb(); - } else { // from user - /* - * user -> user transition relies on an implicit - * memory barrier in switch_mm() when - * current->mm changes. If the architecture - * switch_mm() does not have an implicit memory - * barrier, it is emitted here. If current->mm - * is unchanged, no barrier is needed. - */ - smp_mb__after_switch_mm(); - } - } - if (prev->mm_cid_active) { - mm_cid_snapshot_time(rq, prev->mm); - mm_cid_put_lazy(prev); - prev->mm_cid =3D -1; - } - if (next->mm_cid_active) - next->last_mm_cid =3D next->mm_cid =3D mm_cid_get(rq, next->mm); -} - -#else -static inline void switch_mm_cid(struct rq *rq, struct task_struct *prev, = struct task_struct *next) { } -static inline void sched_mm_cid_migrate_from(struct task_struct *t) { } -static inline void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_= struct *t) { } -static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *cur= r) { } -static inline void init_sched_mm_cid(struct task_struct *t) { } -#endif - extern u64 avg_vruntime(struct cfs_rq *cfs_rq); extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se); =20 --=20 2.39.2