[PATCH v8 05/12] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait()

Ankur Arora posted 12 patches 3 days, 8 hours ago
[PATCH v8 05/12] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait()
Posted by Ankur Arora 3 days, 8 hours ago
In preparation for defining smp_cond_load_acquire_timeout(), remove
the private copy. Lacking this, the rqspinlock code falls back to using
smp_cond_load_acquire().

Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: bpf@vger.kernel.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/rqspinlock.h | 85 -----------------------------
 1 file changed, 85 deletions(-)

diff --git a/arch/arm64/include/asm/rqspinlock.h b/arch/arm64/include/asm/rqspinlock.h
index 9ea0a74e5892..a385603436e9 100644
--- a/arch/arm64/include/asm/rqspinlock.h
+++ b/arch/arm64/include/asm/rqspinlock.h
@@ -3,91 +3,6 @@
 #define _ASM_RQSPINLOCK_H
 
 #include <asm/barrier.h>
-
-/*
- * Hardcode res_smp_cond_load_acquire implementations for arm64 to a custom
- * version based on [0]. In rqspinlock code, our conditional expression involves
- * checking the value _and_ additionally a timeout. However, on arm64, the
- * WFE-based implementation may never spin again if no stores occur to the
- * locked byte in the lock word. As such, we may be stuck forever if
- * event-stream based unblocking is not available on the platform for WFE spin
- * loops (arch_timer_evtstrm_available).
- *
- * Once support for smp_cond_load_acquire_timewait [0] lands, we can drop this
- * copy-paste.
- *
- * While we rely on the implementation to amortize the cost of sampling
- * cond_expr for us, it will not happen when event stream support is
- * unavailable, time_expr check is amortized. This is not the common case, and
- * it would be difficult to fit our logic in the time_expr_ns >= time_limit_ns
- * comparison, hence just let it be. In case of event-stream, the loop is woken
- * up at microsecond granularity.
- *
- * [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
- */
-
-#ifndef smp_cond_load_acquire_timewait
-
-#define smp_cond_time_check_count	200
-
-#define __smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns,	\
-					 time_limit_ns) ({		\
-	typeof(ptr) __PTR = (ptr);					\
-	__unqual_scalar_typeof(*ptr) VAL;				\
-	unsigned int __count = 0;					\
-	for (;;) {							\
-		VAL = READ_ONCE(*__PTR);				\
-		if (cond_expr)						\
-			break;						\
-		cpu_relax();						\
-		if (__count++ < smp_cond_time_check_count)		\
-			continue;					\
-		if ((time_expr_ns) >= (time_limit_ns))			\
-			break;						\
-		__count = 0;						\
-	}								\
-	(typeof(*ptr))VAL;						\
-})
-
-#define __smp_cond_load_acquire_timewait(ptr, cond_expr,		\
-					 time_expr_ns, time_limit_ns)	\
-({									\
-	typeof(ptr) __PTR = (ptr);					\
-	__unqual_scalar_typeof(*ptr) VAL;				\
-	for (;;) {							\
-		VAL = smp_load_acquire(__PTR);				\
-		if (cond_expr)						\
-			break;						\
-		__cmpwait_relaxed(__PTR, VAL);				\
-		if ((time_expr_ns) >= (time_limit_ns))			\
-			break;						\
-	}								\
-	(typeof(*ptr))VAL;						\
-})
-
-#define smp_cond_load_acquire_timewait(ptr, cond_expr,			\
-				      time_expr_ns, time_limit_ns)	\
-({									\
-	__unqual_scalar_typeof(*ptr) _val;				\
-	int __wfe = arch_timer_evtstrm_available();			\
-									\
-	if (likely(__wfe)) {						\
-		_val = __smp_cond_load_acquire_timewait(ptr, cond_expr,	\
-							time_expr_ns,	\
-							time_limit_ns);	\
-	} else {							\
-		_val = __smp_cond_load_relaxed_spinwait(ptr, cond_expr,	\
-							time_expr_ns,	\
-							time_limit_ns);	\
-		smp_acquire__after_ctrl_dep();				\
-	}								\
-	(typeof(*ptr))_val;						\
-})
-
-#endif
-
-#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire_timewait(v, c, 0, 1)
-
 #include <asm-generic/rqspinlock.h>
 
 #endif /* _ASM_RQSPINLOCK_H */
-- 
2.31.1
Re: [PATCH v8 05/12] arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait()
Posted by bot+bpf-ci@kernel.org 3 days, 8 hours ago
> diff --git a/arch/arm64/include/asm/rqspinlock.h b/arch/arm64/include/asm/rqspinlock.h
> index 9ea0a74e5892..a385603436e9 100644
> --- a/arch/arm64/include/asm/rqspinlock.h
> +++ b/arch/arm64/include/asm/rqspinlock.h
> @@ -3,91 +3,6 @@
>  #define _ASM_RQSPINLOCK_H
>
>  #include <asm/barrier.h>
> -
> -/*
> - * Hardcode res_smp_cond_load_acquire implementations for arm64 to a custom
> - * version based on [0]. In rqspinlock code, our conditional expression involves
> - * checking the value _and_ additionally a timeout. However, on arm64, the
> - * WFE-based implementation may never spin again if no stores occur to the
> - * locked byte in the lock word. As such, we may be stuck forever if
> - * event-stream based unblocking is not available on the platform for WFE spin
> - * loops (arch_timer_evtstrm_available).
                                               ^^^^

Can lock acquisition hang indefinitely on platforms without event-stream
support? The deleted implementation provided special handling for the
arch_timer_evtstrm_available() == false case using
__smp_cond_load_relaxed_spinwait(), which checked the timeout condition
every 200 iterations via cpu_relax().

After this removal, resilient_queued_spin_lock_slowpath() at line 408
calls:

    res_smp_cond_load_acquire(&lock->locked,
                              !VAL || RES_CHECK_TIMEOUT(ts, ret, _Q_LOCKED_MASK));

This now falls back to the generic smp_cond_load_acquire() (defined in
kernel/bpf/rqspinlock.c:307), which uses __cmpwait_relaxed(). On
platforms without event-stream, __cmpwait_relaxed() becomes just
cpu_relax() (a yield instruction), and the WFE instruction may never wake
up if no stores occur to the lock word.

Even if WFE does wake up periodically, the RES_CHECK_TIMEOUT macro is
defined with amortization (checking timeout every ~256 iterations when
!(ts.spin++) wraps), which may be too sparse compared to the deleted
implementation's explicit check every 200 iterations.

This means the timeout and deadlock detection logic may not function
correctly, potentially causing lock acquisition to hang instead of
returning -ETIMEDOUT or -EDEADLK after the intended 250ms timeout
(RES_DEF_TIMEOUT).

Note: The next commit in the series (f150b57ec7c8 "bpf/rqspinlock: Use
smp_cond_load_acquire_timeout()") switches to proper timeout-aware
primitives, which resolves this issue.

> - *
> - * Once support for smp_cond_load_acquire_timewait [0] lands, we can drop this
> - * copy-paste.

[ ... ]

> -#define res_smp_cond_load_acquire(v, c) smp_cond_load_acquire_timewait(v, c, 0, 1)
> -
>  #include <asm-generic/rqspinlock.h>
>
>  #endif /* _ASM_RQSPINLOCK_H */


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20221060323