arch/arm64/include/asm/barrier.h | 82 +++++++++++++++ arch/arm64/include/asm/rqspinlock.h | 96 ++++-------------- include/asm-generic/barrier.h | 150 ++++++++++++++++++++++++++++ 3 files changed, 251 insertions(+), 77 deletions(-)
Hi,
This series adds waited variants of the smp_cond_load() primitives:
smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().
There are two known users for these interfaces:
- poll_idle() [1]
- resilient queued spinlocks [2]
For both of these cases we want to wait on a condition but also want
to terminate the wait based on a timeout.
Before describing how v2 implements these interfaces, let me recap the
problems in v1 (Catalin outlined most of these in [3]):
smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, time_limit_ns)
took four arguments, with ptr and cond_expr doing the usual smp_cond_load()
things and time_expr_ns and time_limit_ns being used to decide the
terminating condition.
There were some problems in the timekeeping:
1. How often do we do the (relatively expensive) time-check?
The choice made was once very 200 spin-wait iterations, with each
iteration trying to idle the pipeline by executing cpu_relax().
The choice of 200 was, of course, arbitrary and somewhat meaningless
across architectures. On recent x86, cpu_relax()/PAUSE takes ~20-30
cycles, but on (non-SMT) arm64 cpu_relax()/YIELD is effectively
just a NOP.
Even if each architecture had its own limit, this will also vary
across microarchitectures.
2. On arm64, which can do better than just cpu_relax(), for instance,
by waiting for a store on an address (WFE), the implementation
exclusively used WFE, with the spin-wait only used as a fallback
for when the event-stream was disabled.
One problem with this was that the worst case timeout overshoot
with WFE is ARCH_TIMER_EVT_STREAM_PERIOD_US (100us) and so there's
a vast gulf between that and a potentially much smaller granularity
with the spin-wait versions. In addition the interface provided
no way for the caller to specify or limit the oveshoot.
Non-timekeeping issues:
3. The interface was useful for poll_idle() like users but was not
usable if the caller needed to do any work. For instance,
rqspinlock uses it thus:
smp_cond_load_acquire_timewait(v, c, 0, 1)
Here the time-check always evaluates to false and all of the logic
(ex. deadlock checking) is folded into the conditional.
With that foundation, the new interface is:
smp_cond_load_relaxed_timewait(ptr, cond_expr, wait_policy,
time_expr, time_end)
The added parameter, wait_policy provides a mechanism for the caller
to apportion time spent spinning or, where supported, in a wait.
This is somewhat inspired from the queue_poll() mechanism used
with smp_cond_load() in arm-smmu-v3 [4].
It addresses (1) by deciding the time-check granularity based on a
time interval instead of spinning for a fixed number of iterations.
(2) is addressed by the wait_policy allowing for different slack
values. The implemented versions of wait_policy allow for a coarse
or a fine grained slack. A user defined wait_policy could choose
its own wait parameter. This would also address (3).
With that, patches 1-5, add the generic and arm64 logic:
"asm-generic: barrier: add smp_cond_load_relaxed_timewait()",
"asm-generic: barrier: add wait_policy handlers"
"arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait()"
"arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait()"
"arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait()".
And, patch 6, adds the acquire variant:
"asm-generic: barrier: add smp_cond_load_acquire_timewait()"
And, finally patch 7 lays out how this could be used for rqspinlock:
"bpf: rqspinlock: add rqspinlock policy handler for arm64".
Any comments appreciated!
Ankur
[1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
[2] Uses the smp_cond_load_acquire_timewait() from v1
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h
[3] https://lore.kernel.org/lkml/Z8dRalfxYcJIcLGj@arm.com/
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#n223
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: linux-arch@vger.kernel.org
Ankur Arora (7):
asm-generic: barrier: add smp_cond_load_relaxed_timewait()
asm-generic: barrier: add wait_policy handlers
arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait()
arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait()
arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait()
asm-generic: barrier: add smp_cond_load_acquire_timewait()
bpf: rqspinlock: add rqspinlock policy handler for arm64
arch/arm64/include/asm/barrier.h | 82 +++++++++++++++
arch/arm64/include/asm/rqspinlock.h | 96 ++++--------------
include/asm-generic/barrier.h | 150 ++++++++++++++++++++++++++++
3 files changed, 251 insertions(+), 77 deletions(-)
--
2.43.5
On Fri, 2025-05-02 at 01:52 -0700, Ankur Arora wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Hi, > > This series adds waited variants of the smp_cond_load() primitives: > smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait(). > > There are two known users for these interfaces: > > - poll_idle() [1] > - resilient queued spinlocks [2] > > For both of these cases we want to wait on a condition but also want > to terminate the wait based on a timeout. > > Before describing how v2 implements these interfaces, let me recap the > problems in v1 (Catalin outlined most of these in [3]): > > smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, time_limit_ns) > took four arguments, with ptr and cond_expr doing the usual smp_cond_load() > things and time_expr_ns and time_limit_ns being used to decide the > terminating condition. > > There were some problems in the timekeeping: > > 1. How often do we do the (relatively expensive) time-check? > > The choice made was once very 200 spin-wait iterations, with each > iteration trying to idle the pipeline by executing cpu_relax(). > > The choice of 200 was, of course, arbitrary and somewhat meaningless > across architectures. On recent x86, cpu_relax()/PAUSE takes ~20-30 > cycles, but on (non-SMT) arm64 cpu_relax()/YIELD is effectively > just a NOP. > > Even if each architecture had its own limit, this will also vary > across microarchitectures. > > 2. On arm64, which can do better than just cpu_relax(), for instance, > by waiting for a store on an address (WFE), the implementation > exclusively used WFE, with the spin-wait only used as a fallback > for when the event-stream was disabled. > > One problem with this was that the worst case timeout overshoot > with WFE is ARCH_TIMER_EVT_STREAM_PERIOD_US (100us) and so there's > a vast gulf between that and a potentially much smaller granularity > with the spin-wait versions. In addition the interface provided > no way for the caller to specify or limit the oveshoot. > > Non-timekeeping issues: > > 3. The interface was useful for poll_idle() like users but was not > usable if the caller needed to do any work. For instance, > rqspinlock uses it thus: > > smp_cond_load_acquire_timewait(v, c, 0, 1) > > Here the time-check always evaluates to false and all of the logic > (ex. deadlock checking) is folded into the conditional. > > > With that foundation, the new interface is: > > smp_cond_load_relaxed_timewait(ptr, cond_expr, wait_policy, > time_expr, time_end) > > The added parameter, wait_policy provides a mechanism for the caller > to apportion time spent spinning or, where supported, in a wait. > This is somewhat inspired from the queue_poll() mechanism used > with smp_cond_load() in arm-smmu-v3 [4]. > > It addresses (1) by deciding the time-check granularity based on a > time interval instead of spinning for a fixed number of iterations. > > (2) is addressed by the wait_policy allowing for different slack > values. The implemented versions of wait_policy allow for a coarse > or a fine grained slack. A user defined wait_policy could choose > its own wait parameter. This would also address (3). > > > With that, patches 1-5, add the generic and arm64 logic: > > "asm-generic: barrier: add smp_cond_load_relaxed_timewait()", > "asm-generic: barrier: add wait_policy handlers" > > "arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait()" > "arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait()" > "arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait()". > > And, patch 6, adds the acquire variant: > > "asm-generic: barrier: add smp_cond_load_acquire_timewait()" > > And, finally patch 7 lays out how this could be used for rqspinlock: > > "bpf: rqspinlock: add rqspinlock policy handler for arm64". > > Any comments appreciated! > > Ankur > > > [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/ > [2] Uses the smp_cond_load_acquire_timewait() from v1 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h > [3] https://lore.kernel.org/lkml/Z8dRalfxYcJIcLGj@arm.com/ > [4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#n223 > > > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Will Deacon <will@kernel.org> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> > Cc: Alexei Starovoitov <ast@kernel.org> > Cc: linux-arch@vger.kernel.org > > > Ankur Arora (7): > asm-generic: barrier: add smp_cond_load_relaxed_timewait() > asm-generic: barrier: add wait_policy handlers > arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait() > arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait() > arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait() > asm-generic: barrier: add smp_cond_load_acquire_timewait() > bpf: rqspinlock: add rqspinlock policy handler for arm64 > > arch/arm64/include/asm/barrier.h | 82 +++++++++++++++ > arch/arm64/include/asm/rqspinlock.h | 96 ++++-------------- > include/asm-generic/barrier.h | 150 ++++++++++++++++++++++++++++ > 3 files changed, 251 insertions(+), 77 deletions(-) > > -- > 2.43.5 > Tested on AWS Graviton (ARM64 Neoverse V1) with your V10 haltpoll changes, atop master 83a896549f. Reviewed-by: Haris Okanovic <harisokn@amazon.com> Tested-by: Haris Okanovic <harisokn@amazon.com> Regards, Haris Okanovic AWS Graviton Software
Okanovic, Haris <harisokn@amazon.com> writes: > On Fri, 2025-05-02 at 01:52 -0700, Ankur Arora wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> Hi, >> >> This series adds waited variants of the smp_cond_load() primitives: >> smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait(). >> >> There are two known users for these interfaces: >> >> - poll_idle() [1] >> - resilient queued spinlocks [2] >> >> For both of these cases we want to wait on a condition but also want >> to terminate the wait based on a timeout. >> >> Before describing how v2 implements these interfaces, let me recap the >> problems in v1 (Catalin outlined most of these in [3]): >> >> smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, time_limit_ns) >> took four arguments, with ptr and cond_expr doing the usual smp_cond_load() >> things and time_expr_ns and time_limit_ns being used to decide the >> terminating condition. >> >> There were some problems in the timekeeping: >> >> 1. How often do we do the (relatively expensive) time-check? >> >> The choice made was once very 200 spin-wait iterations, with each >> iteration trying to idle the pipeline by executing cpu_relax(). >> >> The choice of 200 was, of course, arbitrary and somewhat meaningless >> across architectures. On recent x86, cpu_relax()/PAUSE takes ~20-30 >> cycles, but on (non-SMT) arm64 cpu_relax()/YIELD is effectively >> just a NOP. >> >> Even if each architecture had its own limit, this will also vary >> across microarchitectures. >> >> 2. On arm64, which can do better than just cpu_relax(), for instance, >> by waiting for a store on an address (WFE), the implementation >> exclusively used WFE, with the spin-wait only used as a fallback >> for when the event-stream was disabled. >> >> One problem with this was that the worst case timeout overshoot >> with WFE is ARCH_TIMER_EVT_STREAM_PERIOD_US (100us) and so there's >> a vast gulf between that and a potentially much smaller granularity >> with the spin-wait versions. In addition the interface provided >> no way for the caller to specify or limit the oveshoot. >> >> Non-timekeeping issues: >> >> 3. The interface was useful for poll_idle() like users but was not >> usable if the caller needed to do any work. For instance, >> rqspinlock uses it thus: >> >> smp_cond_load_acquire_timewait(v, c, 0, 1) >> >> Here the time-check always evaluates to false and all of the logic >> (ex. deadlock checking) is folded into the conditional. >> >> >> With that foundation, the new interface is: >> >> smp_cond_load_relaxed_timewait(ptr, cond_expr, wait_policy, >> time_expr, time_end) >> >> The added parameter, wait_policy provides a mechanism for the caller >> to apportion time spent spinning or, where supported, in a wait. >> This is somewhat inspired from the queue_poll() mechanism used >> with smp_cond_load() in arm-smmu-v3 [4]. >> >> It addresses (1) by deciding the time-check granularity based on a >> time interval instead of spinning for a fixed number of iterations. >> >> (2) is addressed by the wait_policy allowing for different slack >> values. The implemented versions of wait_policy allow for a coarse >> or a fine grained slack. A user defined wait_policy could choose >> its own wait parameter. This would also address (3). >> >> >> With that, patches 1-5, add the generic and arm64 logic: >> >> "asm-generic: barrier: add smp_cond_load_relaxed_timewait()", >> "asm-generic: barrier: add wait_policy handlers" >> >> "arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait()" >> "arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait()" >> "arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait()". >> >> And, patch 6, adds the acquire variant: >> >> "asm-generic: barrier: add smp_cond_load_acquire_timewait()" >> >> And, finally patch 7 lays out how this could be used for rqspinlock: >> >> "bpf: rqspinlock: add rqspinlock policy handler for arm64". >> >> Any comments appreciated! >> >> Ankur >> >> >> [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/ >> [2] Uses the smp_cond_load_acquire_timewait() from v1 >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h >> [3] https://lore.kernel.org/lkml/Z8dRalfxYcJIcLGj@arm.com/ >> [4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#n223 >> >> >> Cc: Arnd Bergmann <arnd@arndb.de> >> Cc: Will Deacon <will@kernel.org> >> Cc: Catalin Marinas <catalin.marinas@arm.com> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> >> Cc: Alexei Starovoitov <ast@kernel.org> >> Cc: linux-arch@vger.kernel.org >> >> >> Ankur Arora (7): >> asm-generic: barrier: add smp_cond_load_relaxed_timewait() >> asm-generic: barrier: add wait_policy handlers >> arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait() >> arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait() >> arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait() >> asm-generic: barrier: add smp_cond_load_acquire_timewait() >> bpf: rqspinlock: add rqspinlock policy handler for arm64 >> >> arch/arm64/include/asm/barrier.h | 82 +++++++++++++++ >> arch/arm64/include/asm/rqspinlock.h | 96 ++++-------------- >> include/asm-generic/barrier.h | 150 ++++++++++++++++++++++++++++ >> 3 files changed, 251 insertions(+), 77 deletions(-) >> >> -- >> 2.43.5 >> > > Tested on AWS Graviton (ARM64 Neoverse V1) with your V10 haltpoll > changes, atop master 83a896549f. > > Reviewed-by: Haris Okanovic <harisokn@amazon.com> > Tested-by: Haris Okanovic <harisokn@amazon.com> Thanks for the review (and the testing)! -- ankur
On Fri, 2 May 2025, Ankur Arora wrote: > smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, time_limit_ns) > took four arguments, with ptr and cond_expr doing the usual smp_cond_load() > things and time_expr_ns and time_limit_ns being used to decide the > terminating condition. > > There were some problems in the timekeeping: > > 1. How often do we do the (relatively expensive) time-check? Is this really important? We have instructions that wait on an event and terminate at cycle counter values like WFET on arm64 The case were we need to perform time checks is only needed if the processor does not support WFET but must use a event stream or does not even have that available. So the best approach is to have a simple interface were we specify the cycle count when the wait is to be terminated and where we can cover that with one WFET instruction. The other cases then are degenerate forms of that. If only WFE is available then only use that if the timeout is larger than the event stream granularity. Or if both are not available them do the relax / loop thing. So the interface could be much simpler: __smp_cond_load_relaxed_wait(ptr, timeout_cycle_count) with a wrapper smp_cond_relaxed_wait_expr(ptr, expr, timeout cycle count) where we check the expression too and retry if the expression is not true. The fallbacks with the spins and relax logic would be architecture specific.
Christoph Lameter (Ampere) <cl@gentwo.org> writes:
> On Fri, 2 May 2025, Ankur Arora wrote:
>
>> smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, time_limit_ns)
>> took four arguments, with ptr and cond_expr doing the usual smp_cond_load()
>> things and time_expr_ns and time_limit_ns being used to decide the
>> terminating condition.
>>
>> There were some problems in the timekeeping:
>>
>> 1. How often do we do the (relatively expensive) time-check?
>
> Is this really important? We have instructions that wait on an event and
> terminate at cycle counter values like WFET on arm64
>
> The case were we need to perform time checks is only needed if the
> processor does not support WFET but must use a event stream or does not
> even have that available.
AFAICT the vast majority of arm64 processors in the wild don't yet
support WFET. For instance I haven't been able to find a single one
to test my WFET changes with ;).
The other part is that this needs to be in common code and x86 primarily
uses PAUSE.
So, supporting both configurations: WFE + spin on arm64 and PAUSE on x86
needs a way of rate-limiting the time-check. Otherwise you run into
issues like this one:
commit 4dc2375c1a4e88ed2701f6961e0e4f9a7696ad3c
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date: Tue Mar 27 23:58:45 2018 +0200
cpuidle: poll_state: Avoid invoking local_clock() too often
Rik reports that he sees an increase in CPU use in one benchmark
due to commit 612f1a22f067 "cpuidle: poll_state: Add time limit to
poll_idle()" that caused poll_idle() to call local_clock() in every
iteration of the loop. Utilization increase generally means more
non-idle time with respect to total CPU time (on the average) which
implies reduced CPU frequency.
Doug reports that limiting the rate of local_clock() invocations
in there causes much less power to be drawn during a CPU-intensive
parallel workload (with idle states 1 and 2 disabled to enforce more
state 0 residency).
These two reports together suggest that executing local_clock() on
multiple CPUs in parallel at a high rate may cause chips to get hot
and trigger thermal/power limits on them to kick in, so reduce the
rate of local_clock() invocations in poll_idle() to avoid that issue.
> So the best approach is to have a simple interface were we specify the
> cycle count when the wait is to be terminated and where we can cover that
> with one WFET instruction.
>
> The other cases then are degenerate forms of that. If only WFE is
> available then only use that if the timeout is larger than the event
> stream granularity. Or if both are not available them do the relax /
> loop thing.
That's what I had proposed for v1. But as Catalin pointed out, that's
not very useful when the caller wants to limit the overshoot.
This version tries to optimistically use WFE where possible while
minimizing the spin time.
> So the interface could be much simpler:
>
> __smp_cond_load_relaxed_wait(ptr, timeout_cycle_count)
>
> with a wrapper
>
> smp_cond_relaxed_wait_expr(ptr, expr, timeout cycle count)
Oh, I would have absolutely liked to keep the interface simple, but
couldn't see a way to do that while managing the other constraints.
For instance, different users want different clocks: poll_idle() can do
with an imprecise clock but rqspinlock needs ktime_get_mono().
I think using standard clock types is also better instead of using
arm64 specific cycles or tsc or whatever.
> where we check the expression too and retry if the expression is not true.
>
> The fallbacks with the spins and relax logic would be architecture
> specific.
Even if they were all architecture specific, I suspect there's quite a
lot of variation between cpu_relax() across microarchitectures. For
instance YIELD is a nop on non-SMT arm64, but probably heavier on
SMT arm64.
Thanks for the quick review!
Ankur
On Fri, 2 May 2025, Ankur Arora wrote: > AFAICT the vast majority of arm64 processors in the wild don't yet > support WFET. For instance I haven't been able to find a single one > to test my WFET changes with ;). Ok then for patch 1-6: Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>
Christoph Lameter (Ampere) <cl@gentwo.org> writes: > On Fri, 2 May 2025, Ankur Arora wrote: > >> AFAICT the vast majority of arm64 processors in the wild don't yet >> support WFET. For instance I haven't been able to find a single one >> to test my WFET changes with ;). > > Ok then for patch 1-6: > > Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org> Great. Thanks Christoph! -- ankur
© 2016 - 2026 Red Hat, Inc.