[PATCH v5 2/5] arm64: barrier: Add smp_cond_load_relaxed_timeout()

Ankur Arora posted 5 patches 3 weeks ago
[PATCH v5 2/5] arm64: barrier: Add smp_cond_load_relaxed_timeout()
Posted by Ankur Arora 3 weeks ago
Add smp_cond_load_relaxed_timeout(), a timed variant of
smp_cond_load_relaxed().

This uses __cmpwait_relaxed() to do the actual waiting, with the
event-stream guaranteeing that we wake up from WFE periodically
and not block forever in case there are no stores to the cacheline.

For cases when the event-stream is unavailable, fallback to
spin-waiting.

Cc: Will Deacon <will@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Haris Okanovic <harisokn@amazon.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/include/asm/barrier.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index f5801b0ba9e9..4f0d9ed7a072 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -219,6 +219,29 @@ do {									\
 	(typeof(*ptr))VAL;						\
 })
 
+/* Re-declared here to avoid include dependency. */
+extern bool arch_timer_evtstrm_available(void);
+
+#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr)	\
+({									\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	bool __wfe = arch_timer_evtstrm_available();			\
+									\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		if (time_check_expr)					\
+			break;						\
+		if (likely(__wfe))					\
+			__cmpwait_relaxed(__PTR, VAL);			\
+		else							\
+			cpu_relax();					\
+	}								\
+	(typeof(*ptr)) VAL;						\
+})
+
 #include <asm-generic/barrier.h>
 
 #endif	/* __ASSEMBLY__ */
-- 
2.43.5
Re: [PATCH v5 2/5] arm64: barrier: Add smp_cond_load_relaxed_timeout()
Posted by Will Deacon 2 weeks ago
On Wed, Sep 10, 2025 at 08:46:52PM -0700, Ankur Arora wrote:
> Add smp_cond_load_relaxed_timeout(), a timed variant of
> smp_cond_load_relaxed().
> 
> This uses __cmpwait_relaxed() to do the actual waiting, with the
> event-stream guaranteeing that we wake up from WFE periodically
> and not block forever in case there are no stores to the cacheline.
> 
> For cases when the event-stream is unavailable, fallback to
> spin-waiting.
> 
> Cc: Will Deacon <will@kernel.org>
> Cc: linux-arm-kernel@lists.infradead.org
> Suggested-by: Catalin Marinas <catalin.marinas@arm.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> Reviewed-by: Haris Okanovic <harisokn@amazon.com>
> Tested-by: Haris Okanovic <harisokn@amazon.com>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>  arch/arm64/include/asm/barrier.h | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index f5801b0ba9e9..4f0d9ed7a072 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -219,6 +219,29 @@ do {									\
>  	(typeof(*ptr))VAL;						\
>  })
>  
> +/* Re-declared here to avoid include dependency. */
> +extern bool arch_timer_evtstrm_available(void);
> +
> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr)	\
> +({									\
> +	typeof(ptr) __PTR = (ptr);					\
> +	__unqual_scalar_typeof(*ptr) VAL;				\
> +	bool __wfe = arch_timer_evtstrm_available();			\
> +									\
> +	for (;;) {							\
> +		VAL = READ_ONCE(*__PTR);				\
> +		if (cond_expr)						\
> +			break;						\
> +		if (time_check_expr)					\
> +			break;						\
> +		if (likely(__wfe))					\
> +			__cmpwait_relaxed(__PTR, VAL);			\
> +		else							\
> +			cpu_relax();					\

It'd be an awful lot nicer if we could just use the generic code if
wfe isn't available. One option would be to make that available as
e.g. __smp_cond_load_relaxed_timeout_cpu_relax() and call it from the
arch code when !arch_timer_evtstrm_available() but a potentially cleaner
version would be to introduce something like cpu_poll_relax() and use
that in the core code.

So arm64 would do:

#define SMP_TIMEOUT_SPIN_COUNT	1
#define cpu_poll_relax(ptr, val)	do {				\
	if (arch_timer_evtstrm_available())				\
		__cmpwait_relaxed(ptr, val);				\
	else								\
		cpu_relax();						\
} while (0)

and then the core code would have:

#ifndef cpu_poll_relax
#define cpu_poll_relax(p, v)	cpu_relax()
#endif

and could just use cpu_poll_relax() in the generic implementation of
smp_cond_load_relaxed_timeout().

Will
Re: [PATCH v5 2/5] arm64: barrier: Add smp_cond_load_relaxed_timeout()
Posted by Catalin Marinas 1 week, 6 days ago
On Thu, Sep 18, 2025 at 09:05:22PM +0100, Will Deacon wrote:
> > +/* Re-declared here to avoid include dependency. */
> > +extern bool arch_timer_evtstrm_available(void);
> > +
> > +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr)	\
> > +({									\
> > +	typeof(ptr) __PTR = (ptr);					\
> > +	__unqual_scalar_typeof(*ptr) VAL;				\
> > +	bool __wfe = arch_timer_evtstrm_available();			\
> > +									\
> > +	for (;;) {							\
> > +		VAL = READ_ONCE(*__PTR);				\
> > +		if (cond_expr)						\
> > +			break;						\
> > +		if (time_check_expr)					\
> > +			break;						\
> > +		if (likely(__wfe))					\
> > +			__cmpwait_relaxed(__PTR, VAL);			\
> > +		else							\
> > +			cpu_relax();					\
> 
> It'd be an awful lot nicer if we could just use the generic code if
> wfe isn't available. One option would be to make that available as
> e.g. __smp_cond_load_relaxed_timeout_cpu_relax() and call it from the
> arch code when !arch_timer_evtstrm_available() but a potentially cleaner
> version would be to introduce something like cpu_poll_relax() and use
> that in the core code.
> 
> So arm64 would do:
> 
> #define SMP_TIMEOUT_SPIN_COUNT	1
> #define cpu_poll_relax(ptr, val)	do {				\
> 	if (arch_timer_evtstrm_available())				\
> 		__cmpwait_relaxed(ptr, val);				\
> 	else								\
> 		cpu_relax();						\
> } while (0)
> 
> and then the core code would have:
> 
> #ifndef cpu_poll_relax
> #define cpu_poll_relax(p, v)	cpu_relax()
> #endif

A slight problem here is that we have two users that want different spin
counts: poll_idle() uses 200, rqspinlock wants 16K. They've been
empirically chosen but I guess it also depends on what they call in
time_check_expr and the resolution they need. From the discussion on
patch 5, Kumar would like to override the spin count to 16K from the
current one of 200 (or if poll_idle works with 16K, we just set that as
the default; we have yet to hear from the cpuidle folk).

I guess on arm64 we'd first #undef it and redefine it as 1. The
non-event stream variant is for debug only really, I'd expect it to
always have it on in production (or go for WFET).

So yeah, I think the above would work. Ankur proposed something similar
in the early versions but I found it too complicated (a spin and wait
policy callback populating the spin variable). Your proposal looks a lot
simpler.

Thanks.

-- 
Catalin
Re: [PATCH v5 2/5] arm64: barrier: Add smp_cond_load_relaxed_timeout()
Posted by Ankur Arora 1 week, 5 days ago
Catalin Marinas <catalin.marinas@arm.com> writes:

> On Thu, Sep 18, 2025 at 09:05:22PM +0100, Will Deacon wrote:
>> > +/* Re-declared here to avoid include dependency. */
>> > +extern bool arch_timer_evtstrm_available(void);
>> > +
>> > +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr)	\
>> > +({									\
>> > +	typeof(ptr) __PTR = (ptr);					\
>> > +	__unqual_scalar_typeof(*ptr) VAL;				\
>> > +	bool __wfe = arch_timer_evtstrm_available();			\
>> > +									\
>> > +	for (;;) {							\
>> > +		VAL = READ_ONCE(*__PTR);				\
>> > +		if (cond_expr)						\
>> > +			break;						\
>> > +		if (time_check_expr)					\
>> > +			break;						\
>> > +		if (likely(__wfe))					\
>> > +			__cmpwait_relaxed(__PTR, VAL);			\
>> > +		else							\
>> > +			cpu_relax();					\
>>
>> It'd be an awful lot nicer if we could just use the generic code if
>> wfe isn't available. One option would be to make that available as
>> e.g. __smp_cond_load_relaxed_timeout_cpu_relax() and call it from the
>> arch code when !arch_timer_evtstrm_available() but a potentially cleaner
>> version would be to introduce something like cpu_poll_relax() and use
>> that in the core code.
>>
>> So arm64 would do:
>>
>> #define SMP_TIMEOUT_SPIN_COUNT	1
>> #define cpu_poll_relax(ptr, val)	do {				\
>> 	if (arch_timer_evtstrm_available())				\
>> 		__cmpwait_relaxed(ptr, val);				\
>> 	else								\
>> 		cpu_relax();						\
>> } while (0)
>>
>> and then the core code would have:
>>
>> #ifndef cpu_poll_relax
>> #define cpu_poll_relax(p, v)	cpu_relax()
>> #endif
>
> A slight problem here is that we have two users that want different spin
> counts: poll_idle() uses 200, rqspinlock wants 16K. They've been
> empirically chosen but I guess it also depends on what they call in
> time_check_expr and the resolution they need. From the discussion on
> patch 5, Kumar would like to override the spin count to 16K from the
> current one of 200 (or if poll_idle works with 16K, we just set that as
> the default; we have yet to hear from the cpuidle folk).
>
> I guess on arm64 we'd first #undef it and redefine it as 1.

I think you mean 16k? I have some (small) misgivings about that code
choosing the same value for all platforms but that could easily be
addressed if it becomes an issue at some point.

> The
> non-event stream variant is for debug only really, I'd expect it to
> always have it on in production (or go for WFET).


> So yeah, I think the above would work. Ankur proposed something similar
> in the early versions but I found it too complicated (a spin and wait
> policy callback populating the spin variable). Your proposal looks a lot
> simpler.

Yeah. This looks a much simpler way of abstracting the choice of the mechanism,
polling/waiting/some mixture to the architecture without needing any separate
policy etc.

--
ankur