[RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()

Ankur Arora posted 7 patches 1 month, 3 weeks ago
[RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
Posted by Ankur Arora 1 month, 3 weeks ago
Add smp_cond_load_relaxed_timeout(), which extends
smp_cond_load_relaxed() to allow waiting for a duration.

The waiting loop uses cpu_poll_relax() to wait on the condition
variable with a periodic evaluation of a time-check.

cpu_poll_relax() unless overridden by the arch code, amounts to
a cpu_relax().

The number of times we spin is defined by SMP_TIMEOUT_POLL_COUNT
(chosen to be 200 by default) which, assuming each cpu_poll_relax()
iteration takes around 20-30 cycles (measured on a variety of x86
platforms), for a total of ~4000-6000 cycles.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Haris Okanovic <harisokn@amazon.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/asm-generic/barrier.h | 41 +++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..0063b46ec065 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -273,6 +273,47 @@ do {									\
 })
 #endif
 
+#ifndef SMP_TIMEOUT_POLL_COUNT
+#define SMP_TIMEOUT_POLL_COUNT		200
+#endif
+
+#ifndef cpu_poll_relax
+#define cpu_poll_relax(ptr, val)	cpu_relax()
+#endif
+
+/**
+ * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
+ * guarantees until a timeout expires.
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ * @time_check_expr: expression to decide when to bail out
+ *
+ * Equivalent to using READ_ONCE() on the condition variable.
+ */
+#ifndef smp_cond_load_relaxed_timeout
+#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr)	\
+({									\
+	typeof(ptr) __PTR = (ptr);					\
+	__unqual_scalar_typeof(*ptr) VAL;				\
+	u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT;			\
+									\
+	for (;;) {							\
+		VAL = READ_ONCE(*__PTR);				\
+		if (cond_expr)						\
+			break;						\
+		cpu_poll_relax(__PTR, VAL);				\
+		if (++__n < __spin)					\
+			continue;					\
+		if (time_check_expr) {					\
+			VAL = READ_ONCE(*__PTR);			\
+			break;						\
+		}							\
+		__n = 0;						\
+	}								\
+	(typeof(*ptr))VAL;						\
+})
+#endif
+
 /*
  * pmem_wmb() ensures that all stores for which the modification
  * are written to persistent storage by preceding instructions have
-- 
2.43.5
Re: [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
Posted by Arnd Bergmann 1 month, 2 weeks ago
On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:

> + */
> +#ifndef smp_cond_load_relaxed_timeout
> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr)	\
> +({									\
> +	typeof(ptr) __PTR = (ptr);					\
> +	__unqual_scalar_typeof(*ptr) VAL;				\
> +	u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT;			\
> +									\
> +	for (;;) {							\
> +		VAL = READ_ONCE(*__PTR);				\
> +		if (cond_expr)						\
> +			break;						\
> +		cpu_poll_relax(__PTR, VAL);				\
> +		if (++__n < __spin)					\
> +			continue;					\
> +		if (time_check_expr) {					\
> +			VAL = READ_ONCE(*__PTR);			\
> +			break;						\
> +		}							\
> +		__n = 0;						\
> +	}								\
> +	(typeof(*ptr))VAL;						\
> +})
> +#endif

I'm trying to think of ideas for how this would done on arm64
with FEAT_FWXT in a way that doesn't hurt other architectures.

The best idea I've come up with is to change that inner loop
to combine the cpu_poll_relax() with the timecheck and then
define the 'time_check_expr' so it has to return an approximate
(ceiling) number of nanoseconds of remaining time or zero if
expired.

The FEAT_WFXT version would then look something like

static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
{
   unsigned long tmp;
   asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
        : "=&r" (tmp), "+Q" (*ptr)
        : "r" (val), "r" (ns));
}
#define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
({                                                    \
       u64 __t = TIMECHECK;
       if (__t)
            __cmpwait_u64_timeout(__PTR, VAL, __t);
})

while the 'wfe' version would continue to do the timecheck after the
wait.

I have two lesser concerns with the generic definition here:

- having both a timeout and a spin counter in the same loop
  feels redundant and error-prone, as the behavior in practice
  would likely depend a lot on the platform. What is the reason
  for keeping the counter if we already have a fixed timeout
  condition?

- I generally dislike the type-agnostic macros like this one,
  it adds a lot of extra complexity here that I feel can be
  completely avoided if we make explicitly 32-bit and 64-bit
  wide versions of these macros. We probably won't be able
  to resolve this as part of your series, but ideally I'd like
  have explicitly-typed versions of cmpxchg(), smp_load_acquire()
  and all the related ones, the same way we do for atomic_*()
  and atomic64_*().

       Arnd
Re: [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
Posted by Ankur Arora 1 month, 2 weeks ago
Arnd Bergmann <arnd@arndb.de> writes:

> On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>
>> + */
>> +#ifndef smp_cond_load_relaxed_timeout
>> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr)	\
>> +({									\
>> +	typeof(ptr) __PTR = (ptr);					\
>> +	__unqual_scalar_typeof(*ptr) VAL;				\
>> +	u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT;			\
>> +									\
>> +	for (;;) {							\
>> +		VAL = READ_ONCE(*__PTR);				\
>> +		if (cond_expr)						\
>> +			break;						\
>> +		cpu_poll_relax(__PTR, VAL);				\
>> +		if (++__n < __spin)					\
>> +			continue;					\
>> +		if (time_check_expr) {					\
>> +			VAL = READ_ONCE(*__PTR);			\
>> +			break;						\
>> +		}							\
>> +		__n = 0;						\
>> +	}								\
>> +	(typeof(*ptr))VAL;						\
>> +})
>> +#endif
>
> I'm trying to think of ideas for how this would done on arm64
> with FEAT_FWXT in a way that doesn't hurt other architectures.
>
> The best idea I've come up with is to change that inner loop
> to combine the cpu_poll_relax() with the timecheck and then
> define the 'time_check_expr' so it has to return an approximate
> (ceiling) number of nanoseconds of remaining time or zero if
> expired.

Agree that it's a pretty good idea :). I came up with something pretty
similar. Though that had taken a bunch of iterations.

> The FEAT_WFXT version would then look something like
>
> static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
> {
>    unsigned long tmp;
>    asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
>         : "=&r" (tmp), "+Q" (*ptr)
>         : "r" (val), "r" (ns));
> }
> #define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
> ({                                                    \
>        u64 __t = TIMECHECK;
>        if (__t)
>             __cmpwait_u64_timeout(__PTR, VAL, __t);
> })
>
> while the 'wfe' version would continue to do the timecheck after the
> wait.

I think this is a good way to do it if we need the precision
at some point in the future.

> I have two lesser concerns with the generic definition here:
>
> - having both a timeout and a spin counter in the same loop
>   feels redundant and error-prone, as the behavior in practice
>   would likely depend a lot on the platform. What is the reason
>   for keeping the counter if we already have a fixed timeout
>   condition?

The main reason was that the time check is expensive in power terms.
Which is fine for platforms with a WFE like primitive but others
want to do the time check only infrequently. That's why poll_idle()
introduced a rate limit on polling (which the generic definition
reused here.)

    commit 4dc2375c1a4e88ed2701f6961e0e4f9a7696ad3c
    Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Date:   Tue Mar 27 23:58:45 2018 +0200

    cpuidle: poll_state: Avoid invoking local_clock() too often

    Rik reports that he sees an increase in CPU use in one benchmark
    due to commit 612f1a22f067 "cpuidle: poll_state: Add time limit to
    poll_idle()" that caused poll_idle() to call local_clock() in every
    iteration of the loop.  Utilization increase generally means more
    non-idle time with respect to total CPU time (on the average) which
    implies reduced CPU frequency.

    Doug reports that limiting the rate of local_clock() invocations
    in there causes much less power to be drawn during a CPU-intensive
    parallel workload (with idle states 1 and 2 disabled to enforce more
    state 0 residency).

    These two reports together suggest that executing local_clock() on
    multiple CPUs in parallel at a high rate may cause chips to get hot
    and trigger thermal/power limits on them to kick in, so reduce the
    rate of local_clock() invocations in poll_idle() to avoid that issue.

> - I generally dislike the type-agnostic macros like this one,
>   it adds a lot of extra complexity here that I feel can be
>   completely avoided if we make explicitly 32-bit and 64-bit
>   wide versions of these macros. We probably won't be able
>   to resolve this as part of your series, but ideally I'd like
>   have explicitly-typed versions of cmpxchg(), smp_load_acquire()
>   and all the related ones, the same way we do for atomic_*()
>   and atomic64_*().

Ah. And the caller uses say smp_load_acquire_long() or whatever, and
that resolves to whatever makes sense for the arch.

The __unqual_scalar_typeof() does look pretty ugly when looking at the
preprocesed version but other than that smp_cond_load() etc look
pretty straight forward. Just for my curiousity could you elaborate on
the complexity?

--
ankur
Re: [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
Posted by Arnd Bergmann 1 month, 2 weeks ago
On Wed, Oct 29, 2025, at 04:17, Ankur Arora wrote:
> Arnd Bergmann <arnd@arndb.de> writes:
>> On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>> The FEAT_WFXT version would then look something like
>>
>> static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
>> {
>>    unsigned long tmp;
>>    asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
>>         : "=&r" (tmp), "+Q" (*ptr)
>>         : "r" (val), "r" (ns));
>> }
>> #define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
>> ({                                                    \
>>        u64 __t = TIMECHECK;
>>        if (__t)
>>             __cmpwait_u64_timeout(__PTR, VAL, __t);
>> })
>>
>> while the 'wfe' version would continue to do the timecheck after the
>> wait.
>
> I think this is a good way to do it if we need the precision
> at some point in the future.

I'm sorry I wrote this in a confusing way, but I really don't
care about precision here, but for power savings reasons I
would like to use a wfe/wfet based wait here (like you do)
while at the same time being able to turn off the event
stream entirely as long as FEAR_WFXT is available, saving
additional power.

>> I have two lesser concerns with the generic definition here:
>>
>> - having both a timeout and a spin counter in the same loop
>>   feels redundant and error-prone, as the behavior in practice
>>   would likely depend a lot on the platform. What is the reason
>>   for keeping the counter if we already have a fixed timeout
>>   condition?
>
> The main reason was that the time check is expensive in power terms.
> Which is fine for platforms with a WFE like primitive but others
> want to do the time check only infrequently. That's why poll_idle()
> introduced a rate limit on polling (which the generic definition
> reused here.)

Right, I misunderstood how this works, so this part is fine.

>> - I generally dislike the type-agnostic macros like this one,
>>   it adds a lot of extra complexity here that I feel can be
>>   completely avoided if we make explicitly 32-bit and 64-bit
>>   wide versions of these macros. We probably won't be able
>>   to resolve this as part of your series, but ideally I'd like
>>   have explicitly-typed versions of cmpxchg(), smp_load_acquire()
>>   and all the related ones, the same way we do for atomic_*()
>>   and atomic64_*().
>
> Ah. And the caller uses say smp_load_acquire_long() or whatever, and
> that resolves to whatever makes sense for the arch.
>
> The __unqual_scalar_typeof() does look pretty ugly when looking at the
> preprocesed version but other than that smp_cond_load() etc look
> pretty straight forward. Just for my curiousity could you elaborate on
> the complexity?

The nested macros with __unqual_scalar_typeof() make the
preprocessed version completely unreadable, especially when
combined with other sets of complex macros like min()/max(),
tracepoints or arm64 atomics.

If something goes wrong inside of it, it becomes rather
hard for people to debug or even read the compiler warnings.
Since I do a lot of build testing, I do tend to run into those
more than others do.

I've also seen cases where excessively complex macros get
nested to the point of slowing down the kernel build, which
can happen once the nesting expands to megabytes of source
code.

     Arnd
Re: [RESEND PATCH v7 1/7] asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
Posted by Ankur Arora 1 month, 2 weeks ago
Arnd Bergmann <arnd@arndb.de> writes:

> On Wed, Oct 29, 2025, at 04:17, Ankur Arora wrote:
>> Arnd Bergmann <arnd@arndb.de> writes:
>>> On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>>> The FEAT_WFXT version would then look something like
>>>
>>> static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
>>> {
>>>    unsigned long tmp;
>>>    asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
>>>         : "=&r" (tmp), "+Q" (*ptr)
>>>         : "r" (val), "r" (ns));
>>> }
>>> #define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
>>> ({                                                    \
>>>        u64 __t = TIMECHECK;
>>>        if (__t)
>>>             __cmpwait_u64_timeout(__PTR, VAL, __t);
>>> })
>>>
>>> while the 'wfe' version would continue to do the timecheck after the
>>> wait.
>>
>> I think this is a good way to do it if we need the precision
>> at some point in the future.
>
> I'm sorry I wrote this in a confusing way, but I really don't
> care about precision here, but for power savings reasons I
> would like to use a wfe/wfet based wait here (like you do)
> while at the same time being able to turn off the event
> stream entirely as long as FEAR_WFXT is available, saving
> additional power.
>
>>> I have two lesser concerns with the generic definition here:
>>>
>>> - having both a timeout and a spin counter in the same loop
>>>   feels redundant and error-prone, as the behavior in practice
>>>   would likely depend a lot on the platform. What is the reason
>>>   for keeping the counter if we already have a fixed timeout
>>>   condition?
>>
>> The main reason was that the time check is expensive in power terms.
>> Which is fine for platforms with a WFE like primitive but others
>> want to do the time check only infrequently. That's why poll_idle()
>> introduced a rate limit on polling (which the generic definition
>> reused here.)
>
> Right, I misunderstood how this works, so this part is fine.
>
>>> - I generally dislike the type-agnostic macros like this one,
>>>   it adds a lot of extra complexity here that I feel can be
>>>   completely avoided if we make explicitly 32-bit and 64-bit
>>>   wide versions of these macros. We probably won't be able
>>>   to resolve this as part of your series, but ideally I'd like
>>>   have explicitly-typed versions of cmpxchg(), smp_load_acquire()
>>>   and all the related ones, the same way we do for atomic_*()
>>>   and atomic64_*().
>>
>> Ah. And the caller uses say smp_load_acquire_long() or whatever, and
>> that resolves to whatever makes sense for the arch.
>>
>> The __unqual_scalar_typeof() does look pretty ugly when looking at the
>> preprocesed version but other than that smp_cond_load() etc look
>> pretty straight forward. Just for my curiousity could you elaborate on
>> the complexity?
>
> The nested macros with __unqual_scalar_typeof() make the
> preprocessed version completely unreadable, especially when
> combined with other sets of complex macros like min()/max(),
> tracepoints or arm64 atomics.
>
> If something goes wrong inside of it, it becomes rather
> hard for people to debug or even read the compiler warnings.
> Since I do a lot of build testing, I do tend to run into those
> more than others do.
>
> I've also seen cases where excessively complex macros get
> nested to the point of slowing down the kernel build, which
> can happen once the nesting expands to megabytes of source
> code.

Thanks for the detail. Yeah, looking at the preprocessed code (as I had
to do when working on this) it all really feels quite excessive.
Combined with other complex macro constructs ...

--
ankur