Add smp_cond_load_relaxed_timeout(), which extends
smp_cond_load_relaxed() to allow waiting for a duration.
The waiting loop uses cpu_poll_relax() to wait on the condition
variable with a periodic evaluation of a time-check.
cpu_poll_relax() unless overridden by the arch code, amounts to
a cpu_relax().
The number of times we spin is defined by SMP_TIMEOUT_POLL_COUNT
(chosen to be 200 by default) which, assuming each cpu_poll_relax()
iteration takes around 20-30 cycles (measured on a variety of x86
platforms), for a total of ~4000-6000 cycles.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-arch@vger.kernel.org
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Haris Okanovic <harisokn@amazon.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
include/asm-generic/barrier.h | 41 +++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..0063b46ec065 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -273,6 +273,47 @@ do { \
})
#endif
+#ifndef SMP_TIMEOUT_POLL_COUNT
+#define SMP_TIMEOUT_POLL_COUNT 200
+#endif
+
+#ifndef cpu_poll_relax
+#define cpu_poll_relax(ptr, val) cpu_relax()
+#endif
+
+/**
+ * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering
+ * guarantees until a timeout expires.
+ * @ptr: pointer to the variable to wait on
+ * @cond: boolean expression to wait for
+ * @time_check_expr: expression to decide when to bail out
+ *
+ * Equivalent to using READ_ONCE() on the condition variable.
+ */
+#ifndef smp_cond_load_relaxed_timeout
+#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr) \
+({ \
+ typeof(ptr) __PTR = (ptr); \
+ __unqual_scalar_typeof(*ptr) VAL; \
+ u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
+ \
+ for (;;) { \
+ VAL = READ_ONCE(*__PTR); \
+ if (cond_expr) \
+ break; \
+ cpu_poll_relax(__PTR, VAL); \
+ if (++__n < __spin) \
+ continue; \
+ if (time_check_expr) { \
+ VAL = READ_ONCE(*__PTR); \
+ break; \
+ } \
+ __n = 0; \
+ } \
+ (typeof(*ptr))VAL; \
+})
+#endif
+
/*
* pmem_wmb() ensures that all stores for which the modification
* are written to persistent storage by preceding instructions have
--
2.43.5
On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
> + */
> +#ifndef smp_cond_load_relaxed_timeout
> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr) \
> +({ \
> + typeof(ptr) __PTR = (ptr); \
> + __unqual_scalar_typeof(*ptr) VAL; \
> + u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
> + \
> + for (;;) { \
> + VAL = READ_ONCE(*__PTR); \
> + if (cond_expr) \
> + break; \
> + cpu_poll_relax(__PTR, VAL); \
> + if (++__n < __spin) \
> + continue; \
> + if (time_check_expr) { \
> + VAL = READ_ONCE(*__PTR); \
> + break; \
> + } \
> + __n = 0; \
> + } \
> + (typeof(*ptr))VAL; \
> +})
> +#endif
I'm trying to think of ideas for how this would done on arm64
with FEAT_FWXT in a way that doesn't hurt other architectures.
The best idea I've come up with is to change that inner loop
to combine the cpu_poll_relax() with the timecheck and then
define the 'time_check_expr' so it has to return an approximate
(ceiling) number of nanoseconds of remaining time or zero if
expired.
The FEAT_WFXT version would then look something like
static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
{
unsigned long tmp;
asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
: "=&r" (tmp), "+Q" (*ptr)
: "r" (val), "r" (ns));
}
#define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
({ \
u64 __t = TIMECHECK;
if (__t)
__cmpwait_u64_timeout(__PTR, VAL, __t);
})
while the 'wfe' version would continue to do the timecheck after the
wait.
I have two lesser concerns with the generic definition here:
- having both a timeout and a spin counter in the same loop
feels redundant and error-prone, as the behavior in practice
would likely depend a lot on the platform. What is the reason
for keeping the counter if we already have a fixed timeout
condition?
- I generally dislike the type-agnostic macros like this one,
it adds a lot of extra complexity here that I feel can be
completely avoided if we make explicitly 32-bit and 64-bit
wide versions of these macros. We probably won't be able
to resolve this as part of your series, but ideally I'd like
have explicitly-typed versions of cmpxchg(), smp_load_acquire()
and all the related ones, the same way we do for atomic_*()
and atomic64_*().
Arnd
Arnd Bergmann <arnd@arndb.de> writes:
> On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>
>> + */
>> +#ifndef smp_cond_load_relaxed_timeout
>> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_check_expr) \
>> +({ \
>> + typeof(ptr) __PTR = (ptr); \
>> + __unqual_scalar_typeof(*ptr) VAL; \
>> + u32 __n = 0, __spin = SMP_TIMEOUT_POLL_COUNT; \
>> + \
>> + for (;;) { \
>> + VAL = READ_ONCE(*__PTR); \
>> + if (cond_expr) \
>> + break; \
>> + cpu_poll_relax(__PTR, VAL); \
>> + if (++__n < __spin) \
>> + continue; \
>> + if (time_check_expr) { \
>> + VAL = READ_ONCE(*__PTR); \
>> + break; \
>> + } \
>> + __n = 0; \
>> + } \
>> + (typeof(*ptr))VAL; \
>> +})
>> +#endif
>
> I'm trying to think of ideas for how this would done on arm64
> with FEAT_FWXT in a way that doesn't hurt other architectures.
>
> The best idea I've come up with is to change that inner loop
> to combine the cpu_poll_relax() with the timecheck and then
> define the 'time_check_expr' so it has to return an approximate
> (ceiling) number of nanoseconds of remaining time or zero if
> expired.
Agree that it's a pretty good idea :). I came up with something pretty
similar. Though that had taken a bunch of iterations.
> The FEAT_WFXT version would then look something like
>
> static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
> {
> unsigned long tmp;
> asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
> : "=&r" (tmp), "+Q" (*ptr)
> : "r" (val), "r" (ns));
> }
> #define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
> ({ \
> u64 __t = TIMECHECK;
> if (__t)
> __cmpwait_u64_timeout(__PTR, VAL, __t);
> })
>
> while the 'wfe' version would continue to do the timecheck after the
> wait.
I think this is a good way to do it if we need the precision
at some point in the future.
> I have two lesser concerns with the generic definition here:
>
> - having both a timeout and a spin counter in the same loop
> feels redundant and error-prone, as the behavior in practice
> would likely depend a lot on the platform. What is the reason
> for keeping the counter if we already have a fixed timeout
> condition?
The main reason was that the time check is expensive in power terms.
Which is fine for platforms with a WFE like primitive but others
want to do the time check only infrequently. That's why poll_idle()
introduced a rate limit on polling (which the generic definition
reused here.)
commit 4dc2375c1a4e88ed2701f6961e0e4f9a7696ad3c
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date: Tue Mar 27 23:58:45 2018 +0200
cpuidle: poll_state: Avoid invoking local_clock() too often
Rik reports that he sees an increase in CPU use in one benchmark
due to commit 612f1a22f067 "cpuidle: poll_state: Add time limit to
poll_idle()" that caused poll_idle() to call local_clock() in every
iteration of the loop. Utilization increase generally means more
non-idle time with respect to total CPU time (on the average) which
implies reduced CPU frequency.
Doug reports that limiting the rate of local_clock() invocations
in there causes much less power to be drawn during a CPU-intensive
parallel workload (with idle states 1 and 2 disabled to enforce more
state 0 residency).
These two reports together suggest that executing local_clock() on
multiple CPUs in parallel at a high rate may cause chips to get hot
and trigger thermal/power limits on them to kick in, so reduce the
rate of local_clock() invocations in poll_idle() to avoid that issue.
> - I generally dislike the type-agnostic macros like this one,
> it adds a lot of extra complexity here that I feel can be
> completely avoided if we make explicitly 32-bit and 64-bit
> wide versions of these macros. We probably won't be able
> to resolve this as part of your series, but ideally I'd like
> have explicitly-typed versions of cmpxchg(), smp_load_acquire()
> and all the related ones, the same way we do for atomic_*()
> and atomic64_*().
Ah. And the caller uses say smp_load_acquire_long() or whatever, and
that resolves to whatever makes sense for the arch.
The __unqual_scalar_typeof() does look pretty ugly when looking at the
preprocesed version but other than that smp_cond_load() etc look
pretty straight forward. Just for my curiousity could you elaborate on
the complexity?
--
ankur
On Wed, Oct 29, 2025, at 04:17, Ankur Arora wrote:
> Arnd Bergmann <arnd@arndb.de> writes:
>> On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>> The FEAT_WFXT version would then look something like
>>
>> static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
>> {
>> unsigned long tmp;
>> asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
>> : "=&r" (tmp), "+Q" (*ptr)
>> : "r" (val), "r" (ns));
>> }
>> #define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
>> ({ \
>> u64 __t = TIMECHECK;
>> if (__t)
>> __cmpwait_u64_timeout(__PTR, VAL, __t);
>> })
>>
>> while the 'wfe' version would continue to do the timecheck after the
>> wait.
>
> I think this is a good way to do it if we need the precision
> at some point in the future.
I'm sorry I wrote this in a confusing way, but I really don't
care about precision here, but for power savings reasons I
would like to use a wfe/wfet based wait here (like you do)
while at the same time being able to turn off the event
stream entirely as long as FEAR_WFXT is available, saving
additional power.
>> I have two lesser concerns with the generic definition here:
>>
>> - having both a timeout and a spin counter in the same loop
>> feels redundant and error-prone, as the behavior in practice
>> would likely depend a lot on the platform. What is the reason
>> for keeping the counter if we already have a fixed timeout
>> condition?
>
> The main reason was that the time check is expensive in power terms.
> Which is fine for platforms with a WFE like primitive but others
> want to do the time check only infrequently. That's why poll_idle()
> introduced a rate limit on polling (which the generic definition
> reused here.)
Right, I misunderstood how this works, so this part is fine.
>> - I generally dislike the type-agnostic macros like this one,
>> it adds a lot of extra complexity here that I feel can be
>> completely avoided if we make explicitly 32-bit and 64-bit
>> wide versions of these macros. We probably won't be able
>> to resolve this as part of your series, but ideally I'd like
>> have explicitly-typed versions of cmpxchg(), smp_load_acquire()
>> and all the related ones, the same way we do for atomic_*()
>> and atomic64_*().
>
> Ah. And the caller uses say smp_load_acquire_long() or whatever, and
> that resolves to whatever makes sense for the arch.
>
> The __unqual_scalar_typeof() does look pretty ugly when looking at the
> preprocesed version but other than that smp_cond_load() etc look
> pretty straight forward. Just for my curiousity could you elaborate on
> the complexity?
The nested macros with __unqual_scalar_typeof() make the
preprocessed version completely unreadable, especially when
combined with other sets of complex macros like min()/max(),
tracepoints or arm64 atomics.
If something goes wrong inside of it, it becomes rather
hard for people to debug or even read the compiler warnings.
Since I do a lot of build testing, I do tend to run into those
more than others do.
I've also seen cases where excessively complex macros get
nested to the point of slowing down the kernel build, which
can happen once the nesting expands to megabytes of source
code.
Arnd
Arnd Bergmann <arnd@arndb.de> writes:
> On Wed, Oct 29, 2025, at 04:17, Ankur Arora wrote:
>> Arnd Bergmann <arnd@arndb.de> writes:
>>> On Tue, Oct 28, 2025, at 06:31, Ankur Arora wrote:
>>> The FEAT_WFXT version would then look something like
>>>
>>> static inline void __cmpwait_u64_timeout(volatile u64 *ptr, unsigned long val, __u64 ns)
>>> {
>>> unsigned long tmp;
>>> asm volatile ("sev; wfe; ldxr; eor; cbnz; wfet; 1:"
>>> : "=&r" (tmp), "+Q" (*ptr)
>>> : "r" (val), "r" (ns));
>>> }
>>> #define cpu_poll_relax_timeout_wfet(__PTR, VAL, TIMECHECK) \
>>> ({ \
>>> u64 __t = TIMECHECK;
>>> if (__t)
>>> __cmpwait_u64_timeout(__PTR, VAL, __t);
>>> })
>>>
>>> while the 'wfe' version would continue to do the timecheck after the
>>> wait.
>>
>> I think this is a good way to do it if we need the precision
>> at some point in the future.
>
> I'm sorry I wrote this in a confusing way, but I really don't
> care about precision here, but for power savings reasons I
> would like to use a wfe/wfet based wait here (like you do)
> while at the same time being able to turn off the event
> stream entirely as long as FEAR_WFXT is available, saving
> additional power.
>
>>> I have two lesser concerns with the generic definition here:
>>>
>>> - having both a timeout and a spin counter in the same loop
>>> feels redundant and error-prone, as the behavior in practice
>>> would likely depend a lot on the platform. What is the reason
>>> for keeping the counter if we already have a fixed timeout
>>> condition?
>>
>> The main reason was that the time check is expensive in power terms.
>> Which is fine for platforms with a WFE like primitive but others
>> want to do the time check only infrequently. That's why poll_idle()
>> introduced a rate limit on polling (which the generic definition
>> reused here.)
>
> Right, I misunderstood how this works, so this part is fine.
>
>>> - I generally dislike the type-agnostic macros like this one,
>>> it adds a lot of extra complexity here that I feel can be
>>> completely avoided if we make explicitly 32-bit and 64-bit
>>> wide versions of these macros. We probably won't be able
>>> to resolve this as part of your series, but ideally I'd like
>>> have explicitly-typed versions of cmpxchg(), smp_load_acquire()
>>> and all the related ones, the same way we do for atomic_*()
>>> and atomic64_*().
>>
>> Ah. And the caller uses say smp_load_acquire_long() or whatever, and
>> that resolves to whatever makes sense for the arch.
>>
>> The __unqual_scalar_typeof() does look pretty ugly when looking at the
>> preprocesed version but other than that smp_cond_load() etc look
>> pretty straight forward. Just for my curiousity could you elaborate on
>> the complexity?
>
> The nested macros with __unqual_scalar_typeof() make the
> preprocessed version completely unreadable, especially when
> combined with other sets of complex macros like min()/max(),
> tracepoints or arm64 atomics.
>
> If something goes wrong inside of it, it becomes rather
> hard for people to debug or even read the compiler warnings.
> Since I do a lot of build testing, I do tend to run into those
> more than others do.
>
> I've also seen cases where excessively complex macros get
> nested to the point of slowing down the kernel build, which
> can happen once the nesting expands to megabytes of source
> code.
Thanks for the detail. Yeah, looking at the preprocessed code (as I had
to do when working on this) it all really feels quite excessive.
Combined with other complex macro constructs ...
--
ankur
© 2016 - 2025 Red Hat, Inc.