barrier: introduce smp_cond_load_*_timewait()

[PATCH v2 1/7] asm-generic: barrier: add smp_cond_load_relaxed_timewait()

Posted by Ankur Arora 9 months, 1 week ago

Add smp_cond_load_relaxed_timewait(), which extends the non-timeout
variant for cases where we don't want to wait indefinitely.

The interface adds parameters to allow timeout checks and a policy
that decides how exactly to wait for the condition to change.

The waiting is done via the usual cpu_relax() spin-wait around the
conditional variable with periodic evaluation of the time-check
expression, and optionally by architectural primitives that allow
for cheaper mechanisms such as waiting on stores to a memory address
with an out-of-band timeout mechanism.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/asm-generic/barrier.h | 58 +++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..a7be98e906f4 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -273,6 +273,64 @@ do {									\
 })
 #endif
 
+/*
+ * Non-spin primitive that allows waiting for stores to an address,
+ * with support for a timeout. This works in conjunction with an
+ * architecturally defined wait_policy.
+ */
+#ifndef __smp_timewait_store
+#define __smp_timewait_store(ptr, val) do { } while (0)
+#endif
+
+#ifndef __smp_cond_load_relaxed_timewait
+#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, wait_policy,	\
+					 time_expr, time_end) ({	\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	u32 __n = 0, __spin = 0;					\
+	u64 __prev = 0, __end = (time_end);				\
+	bool __wait = false;						\
+									\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		cpu_relax();						\
+		if (++__n < __spin)					\
+			continue;					\
+		if (!(__prev = wait_policy((time_expr), __prev, __end,	\
+					  &__spin, &__wait)))		\
+			break;						\
+		if (__wait)						\
+			__smp_timewait_store(__PTR, VAL);		\
+		__n = 0;						\
+	}								\
+	(typeof(*ptr))VAL;						\
+})
+#endif
+
+/**
+ * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
+ * guarantees until a timeout expires.
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ * @wait_policy: policy handler that adjusts the number of times we spin or
+ *  wait for cacheline to change (depends on architecture, not supported in
+ *  generic code.) before evaluating the time-expr.
+ * @time_expr: monotonic expression that evaluates to the current time
+ * @time_end: compared against time_expr
+ *
+ * Equivalent to using READ_ONCE() on the condition variable.
+ */
+#define smp_cond_load_relaxed_timewait(ptr, cond_expr, wait_policy,	\
+					 time_expr, time_end) ({	\
+	__unqual_scalar_typeof(*ptr) _val;;				\
+	_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
+					      wait_policy, time_expr,	\
+					      time_end);		\
+	(typeof(*ptr))_val;						\
+})
+
 /*
  * pmem_wmb() ensures that all stores for which the modification
  * are written to persistent storage by preceding instructions have
-- 
2.43.5

Re: [PATCH v2 1/7] asm-generic: barrier: add smp_cond_load_relaxed_timewait()

Posted by Catalin Marinas 8 months, 3 weeks ago

Hi Ankur,

Sorry, it took me some time to get back to this series (well, I tried
once and got stuck on what wait_policy is supposed to mean, so decided
to wait until I had more coffee ;)).

On Fri, May 02, 2025 at 01:52:17AM -0700, Ankur Arora wrote:
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..a7be98e906f4 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -273,6 +273,64 @@ do {									\
>  })
>  #endif
>  
> +/*
> + * Non-spin primitive that allows waiting for stores to an address,
> + * with support for a timeout. This works in conjunction with an
> + * architecturally defined wait_policy.
> + */
> +#ifndef __smp_timewait_store
> +#define __smp_timewait_store(ptr, val) do { } while (0)
> +#endif
> +
> +#ifndef __smp_cond_load_relaxed_timewait
> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, wait_policy,	\
> +					 time_expr, time_end) ({	\
> +	typeof(ptr) __PTR = (ptr);					\
> +	__unqual_scalar_typeof(*ptr) VAL;				\
> +	u32 __n = 0, __spin = 0;					\
> +	u64 __prev = 0, __end = (time_end);				\
> +	bool __wait = false;						\
> +									\
> +	for (;;) {							\
> +		VAL = READ_ONCE(*__PTR);				\
> +		if (cond_expr)						\
> +			break;						\
> +		cpu_relax();						\
> +		if (++__n < __spin)					\
> +			continue;					\
> +		if (!(__prev = wait_policy((time_expr), __prev, __end,	\
> +					  &__spin, &__wait)))		\
> +			break;						\
> +		if (__wait)						\
> +			__smp_timewait_store(__PTR, VAL);		\
> +		__n = 0;						\
> +	}								\
> +	(typeof(*ptr))VAL;						\
> +})
> +#endif
> +
> +/**
> + * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
> + * guarantees until a timeout expires.
> + * @ptr: pointer to the variable to wait on
> + * @cond: boolean expression to wait for
> + * @wait_policy: policy handler that adjusts the number of times we spin or
> + *  wait for cacheline to change (depends on architecture, not supported in
> + *  generic code.) before evaluating the time-expr.
> + * @time_expr: monotonic expression that evaluates to the current time
> + * @time_end: compared against time_expr
> + *
> + * Equivalent to using READ_ONCE() on the condition variable.
> + */
> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, wait_policy,	\
> +					 time_expr, time_end) ({	\
> +	__unqual_scalar_typeof(*ptr) _val;;				\
> +	_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
> +					      wait_policy, time_expr,	\
> +					      time_end);		\
> +	(typeof(*ptr))_val;						\
> +})

IIUC, a generic user of this interface would need a wait_policy() that
is aware of the arch details (event stream, WFET etc.), given the
__smp_timewait_store() implementation in patch 3. This becomes clearer
in patch 7 where one needs to create rqspinlock_cond_timewait().

The __spin count can be arch specific, not part of some wait_policy,
even if such policy is most likely implemented in the arch code (as the
generic caller has no clue what it means). The __wait decision, again, I
don't think it should be the caller of this API to decide how to handle,
it's something internal to the API implementation based on whether the
event stream (or later WFET) is available.

The ___cond_timewait() implementation in patch 4 sets __wait if either
the event stream of WFET is available. However, __smp_timewait_store()
only uses WFE as per the __cmpwait_relaxed() implementation. So you
can't really decouple wait_policy() from how the spinning is done, in an
arch-specific way. In this implementation, wait_policy() would need to
say how to wait - WFE, WFET. That's not captured (and I don't think it
should, we can't expand the API every time we have a new method of
waiting).

I still think this interface can be simpler and fairly generic, not with
wait_policy specific to rqspinlock or poll_idle. Maybe you can keep a
policy argument for an internal __smp_cond_load_relaxed_timewait() if
it's easier to structure the code this way but definitely not for
smp_cond_*().

Another aspect I'm not keen on is the arbitrary fine/coarse constants.
Can we not have the caller pass a slack value (in ns or 0 if it doesn't
care) to smp_cond_load_relaxed_timewait() and let the arch code decide
which policy to use?

In summary, I see the API something like:

#define smp_cond_load_relaxed_timewait(ptr, cond_expr,
				       time_expr, time_end, slack_ns)

We can even drop time_end if we capture it in time_expr returning a bool
(like we do with cond_expr).

Thanks.

-- 
Catalin

Re: [PATCH v2 1/7] asm-generic: barrier: add smp_cond_load_relaxed_timewait()

Posted by Ankur Arora 8 months, 3 weeks ago

Catalin Marinas <catalin.marinas@arm.com> writes:

> Hi Ankur,
>
> Sorry, it took me some time to get back to this series (well, I tried
> once and got stuck on what wait_policy is supposed to mean, so decided
> to wait until I had more coffee ;)).

I suppose that's as good a sign as any that the wait_policy stuff needs
to change ;).

> On Fri, May 02, 2025 at 01:52:17AM -0700, Ankur Arora wrote:
>> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
>> index d4f581c1e21d..a7be98e906f4 100644
>> --- a/include/asm-generic/barrier.h
>> +++ b/include/asm-generic/barrier.h
>> @@ -273,6 +273,64 @@ do {									\
>>  })
>>  #endif
>>
>> +/*
>> + * Non-spin primitive that allows waiting for stores to an address,
>> + * with support for a timeout. This works in conjunction with an
>> + * architecturally defined wait_policy.
>> + */
>> +#ifndef __smp_timewait_store
>> +#define __smp_timewait_store(ptr, val) do { } while (0)
>> +#endif
>> +
>> +#ifndef __smp_cond_load_relaxed_timewait
>> +#define __smp_cond_load_relaxed_timewait(ptr, cond_expr, wait_policy,	\
>> +					 time_expr, time_end) ({	\
>> +	typeof(ptr) __PTR = (ptr);					\
>> +	__unqual_scalar_typeof(*ptr) VAL;				\
>> +	u32 __n = 0, __spin = 0;					\
>> +	u64 __prev = 0, __end = (time_end);				\
>> +	bool __wait = false;						\
>> +									\
>> +	for (;;) {							\
>> +		VAL = READ_ONCE(*__PTR);				\
>> +		if (cond_expr)						\
>> +			break;						\
>> +		cpu_relax();						\
>> +		if (++__n < __spin)					\
>> +			continue;					\
>> +		if (!(__prev = wait_policy((time_expr), __prev, __end,	\
>> +					  &__spin, &__wait)))		\
>> +			break;						\
>> +		if (__wait)						\
>> +			__smp_timewait_store(__PTR, VAL);		\
>> +		__n = 0;						\
>> +	}								\
>> +	(typeof(*ptr))VAL;						\
>> +})
>> +#endif
>> +
>> +/**
>> + * smp_cond_load_relaxed_timewait() - (Spin) wait for cond with no ordering
>> + * guarantees until a timeout expires.
>> + * @ptr: pointer to the variable to wait on
>> + * @cond: boolean expression to wait for
>> + * @wait_policy: policy handler that adjusts the number of times we spin or
>> + *  wait for cacheline to change (depends on architecture, not supported in
>> + *  generic code.) before evaluating the time-expr.
>> + * @time_expr: monotonic expression that evaluates to the current time
>> + * @time_end: compared against time_expr
>> + *
>> + * Equivalent to using READ_ONCE() on the condition variable.
>> + */
>> +#define smp_cond_load_relaxed_timewait(ptr, cond_expr, wait_policy,	\
>> +					 time_expr, time_end) ({	\
>> +	__unqual_scalar_typeof(*ptr) _val;;				\
>> +	_val = __smp_cond_load_relaxed_timewait(ptr, cond_expr,		\
>> +					      wait_policy, time_expr,	\
>> +					      time_end);		\
>> +	(typeof(*ptr))_val;						\
>> +})
>
> IIUC, a generic user of this interface would need a wait_policy() that
> is aware of the arch details (event stream, WFET etc.), given the
> __smp_timewait_store() implementation in patch 3. This becomes clearer
> in patch 7 where one needs to create rqspinlock_cond_timewait().

Yes, if a caller can't work with the __smp_cond_timewait_coarse() etc,
they would need to know the mechanics of how to do that on each arch.

I meant the two policies to be somewhat generic, but having to know
the internals is a problem.

> The __spin count can be arch specific, not part of some wait_policy,
> even if such policy is most likely implemented in the arch code (as the
> generic caller has no clue what it means). The __wait decision, again, I
> don't think it should be the caller of this API to decide how to handle,
> it's something internal to the API implementation based on whether the
> event stream (or later WFET) is available.
>
> The ___cond_timewait() implementation in patch 4 sets __wait if either
> the event stream of WFET is available. However, __smp_timewait_store()
> only uses WFE as per the __cmpwait_relaxed() implementation. So you
> can't really decouple wait_policy() from how the spinning is done, in an
> arch-specific way.

Agreed.

> In this implementation, wait_policy() would need to
> say how to wait - WFE, WFET. That's not captured (and I don't think it
> should, we can't expand the API every time we have a new method of
> waiting).

The idea was both the wait_policy and the arch specific interface would
evolve together and so once __cmpwait_relaxed() supports WFET, the
wait_policy would also change alongside.

However, as you say, for users that define their own wait_policy, the
interface becomes a mess to maintain.

> I still think this interface can be simpler and fairly generic, not with
> wait_policy specific to rqspinlock or poll_idle. Maybe you can keep a
> policy argument for an internal __smp_cond_load_relaxed_timewait() if
> it's easier to structure the code this way but definitely not for
> smp_cond_*().

Yeah. I think that's probably the way to do this. The main reason I felt
that we need an explicit wait_policy was to address the rqspinlock case
but as you point out, that makes the interface unmaintainable.

So, this should work (see below for one proviso), for most users:

    #define smp_cond_load_relaxed_timewait(ptr, cond_expr,
     				       time_expr, time_end, slack_us)

(Though, I would use slack_us instead of slack_ns and also keep time_expr
and time_end denominated in us.)

And users like rqspinlock could use __smp_cond_load_relaxed_timewait()
with a policy argument where they can combine rqspinock policy plus
with the common wait policy so wouldn't need to know the internals of
the waiting mechanisms.

> Another aspect I'm not keen on is the arbitrary fine/coarse constants.
> Can we not have the caller pass a slack value (in ns or 0 if it doesn't
> care) to smp_cond_load_relaxed_timewait() and let the arch code decide
> which policy to use?

Yeah, as you probably noticed, that's pretty much how what they are
implemented internally already.

> In summary, I see the API something like:
>
> #define smp_cond_load_relaxed_timewait(ptr, cond_expr,
> 				       time_expr, time_end, slack_ns)

Ack.

> We can even drop time_end if we capture it in time_expr returning a bool
> (like we do with cond_expr).

I'm not sure we can combine time_expr, time_end. Given that we have two
ways to wait: spin and wait, both with different granularity, just a
binary check won't suffice.

For switching between wait and spin, we would also need to compare the
granularity of the mechanism, derive the time-remaining, check against
slack etc.

Thanks for the comments. Most helpful.

--
ankur

[PATCH v2 1/7] asm-generic: barrier: add smp_cond_load_relaxed_timewait()
[PATCH v2 2/7] asm-generic: barrier: add wait_policy handlers
[PATCH v2 3/7] arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait()
[PATCH v2 4/7] arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait()
[PATCH v2 5/7] arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait()
[PATCH v2 6/7] asm-generic: barrier: add smp_cond_load_acquire_timewait()
[PATCH v2 7/7] bpf: rqspinlock: add rqspinlock policy handler for arm64