Documentation/atomic_t.txt | 14 +++-- arch/arm64/Kconfig | 3 + arch/arm64/include/asm/barrier.h | 23 +++++++ arch/arm64/include/asm/cmpxchg.h | 62 +++++++++++++++---- arch/arm64/include/asm/delay-const.h | 27 +++++++++ arch/arm64/include/asm/rqspinlock.h | 85 -------------------------- arch/arm64/lib/delay.c | 15 ++--- drivers/cpuidle/poll_state.c | 21 +------ drivers/soc/qcom/rpmh-rsc.c | 8 +-- include/asm-generic/barrier.h | 90 ++++++++++++++++++++++++++++ include/linux/atomic.h | 10 ++++ include/linux/atomic/atomic-long.h | 18 +++--- include/linux/sched/idle.h | 29 +++++++++ kernel/bpf/rqspinlock.c | 77 +++++++++++++++--------- scripts/atomic/gen-atomic-long.sh | 16 +++-- 15 files changed, 320 insertions(+), 178 deletions(-) create mode 100644 arch/arm64/include/asm/delay-const.h
Hi,
This series adds waited variants of the smp_cond_load() primitives:
smp_cond_load_relaxed_timeout(), and smp_cond_load_acquire_timeout().
With this version, the main remaining things are:
- Review by PeterZ of the new interface tif_need_resched_relaxed_wait()
(patch 11, "sched: add need-resched timed wait interface").
- Review of the BPF changes. This version simplifies the rqspinlock
changes by reusing the original error handling path
(patches 9, 10 "bpf/rqspinlock: switch check_timeout() to a clock
interface", "bpf/rqspinlock: Use smp_cond_load_acquire_timeout()").
- Review of WFET handling. (patch 4, "arm64: support WFET in
smp_cond_load_relaxed_timeout()").
The new interfaces are meant for contexts where you want to wait on a
condition variable for a finite duration. This is easy enough to do with
a loop around cpu_relax(). There are, however, architectures (ex. arm64)
that allow waiting on a cacheline instead.
So, these interfaces handle a mixture of spin/wait with a
smp_cond_load() thrown in. The interfaces are:
smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr, timeout)
smp_cond_load_acquire_timeout(ptr, cond_expr, time_expr, timeout)
The parameters, time_expr, timeout determine when to bail out.
Also add tif_need_resched_relaxed_wait() which wraps the pattern used
in poll_idle() and abstracts out details of the interface and those
of the scheduler.
In addition add atomic_cond_read_*_timeout(), atomic64_cond_read_*_timeout(),
and atomic_long wrappers to the interfaces.
Finally update poll_idle() and resilient queued spinlocks to use them.
Changelog:
v9 [9]:
- s/@cond/@cond_expr/ (Randy Dunlap)
- Clarify that SMP_TIMEOUT_POLL_COUNT is only around memory
addresses. (David Laight)
- Add the missing config ARCH_HAS_CPU_RELAX in arch/arm64/Kconfig.
(Catalin Marinas).
- Switch to arch_counter_get_cntvct_stable() (via __delay_cycles())
in the cmpwait path instead of using arch_timer_read_counter().
(Catalin Marinas)
v8 [0]:
- Defer evaluation of @time_expr_ns to when we hit the slowpath.
(comment from Alexei Starovoitov).
- Mention that cpu_poll_relax() is better than raw CPU polling
only where ARCH_HAS_CPU_RELAX is defined.
- also define ARCH_HAS_CPU_RELAX for arm64.
(Came out of a discussion with Will Deacon.)
- Split out WFET and WFE handling. I was doing both of these
in a common handler.
(From Will Deacon and in an earlier revision by Catalin Marinas.)
- Add mentions of atomic_cond_read_{relaxed,acquire}(),
atomic_cond_read_{relaxed,acquire}_timeout() in
Documentation/atomic_t.txt.
- Use the BIT() macro to do the checking in tif_bitset_relaxed_wait().
- Cleanup unnecessary assignments, casts etc in poll_idle().
(From Rafael Wysocki.)
- Fixup warnings from kernel build robot
v7 [1]:
- change the interface to separately provide the timeout. This is
useful for supporting WFET and similar primitives which can do
timed waiting (suggested by Arnd Bergmann).
- Adapting rqspinlock code to this changed interface also
necessitated allowing time_expr to fail.
- rqspinlock changes to adapt to the new smp_cond_load_acquire_timeout().
- add WFET support (suggested by Arnd Bergmann).
- add support for atomic-long wrappers.
- add a new scheduler interface tif_need_resched_relaxed_wait() which
encapsulates the polling logic used by poll_idle().
- interface suggested by (Rafael J. Wysocki).
v6 [2]:
- fixup missing timeout parameters in atomic64_cond_read_*_timeout()
- remove a race between setting of TIF_NEED_RESCHED and the call to
smp_cond_load_relaxed_timeout(). This would mean that dev->poll_time_limit
would be set even if we hadn't spent any time waiting.
(The original check compared against local_clock(), which would have been
fine, but I was instead using a cheaper check against _TIF_NEED_RESCHED.)
(Both from meta-CI bot)
v5 [3]:
- use cpu_poll_relax() instead of cpu_relax().
- instead of defining an arm64 specific
smp_cond_load_relaxed_timeout(), just define the appropriate
cpu_poll_relax().
- re-read the target pointer when we exit due to the time-check.
- s/SMP_TIMEOUT_SPIN_COUNT/SMP_TIMEOUT_POLL_COUNT/
(Suggested by Will Deacon)
- add atomic_cond_read_*_timeout() and atomic64_cond_read_*_timeout()
interfaces.
- rqspinlock: use atomic_cond_read_acquire_timeout().
- cpuidle: use smp_cond_load_relaxed_tiemout() for polling.
(Suggested by Catalin Marinas)
- rqspinlock: define SMP_TIMEOUT_POLL_COUNT to be 16k for non arm64
v4 [4]:
- naming change 's/timewait/timeout/'
- resilient spinlocks: get rid of res_smp_cond_load_acquire_waiting()
and fixup use of RES_CHECK_TIMEOUT().
(Both suggested by Catalin Marinas)
v3 [5]:
- further interface simplifications (suggested by Catalin Marinas)
v2 [6]:
- simplified the interface (suggested by Catalin Marinas)
- get rid of wait_policy, and a multitude of constants
- adds a slack parameter
This helped remove a fair amount of duplicated code duplication and in
hindsight unnecessary constants.
v1 [7]:
- add wait_policy (coarse and fine)
- derive spin-count etc at runtime instead of using arbitrary
constants.
Haris Okanovic tested v4 of this series with poll_idle()/haltpoll patches. [8]
Comments appreciated!
Thanks
Ankur
[0] https://lore.kernel.org/lkml/20251215044919.460086-1-ankur.a.arora@oracle.com/
[1] https://lore.kernel.org/lkml/20251028053136.692462-1-ankur.a.arora@oracle.com/
[2] https://lore.kernel.org/lkml/20250911034655.3916002-1-ankur.a.arora@oracle.com/
[3] https://lore.kernel.org/lkml/20250911034655.3916002-1-ankur.a.arora@oracle.com/
[4] https://lore.kernel.org/lkml/20250829080735.3598416-1-ankur.a.arora@oracle.com/
[5] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/
[6] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/
[7] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
[8] https://lore.kernel.org/lkml/2cecbf7fb23ee83a4ce027e1be3f46f97efd585c.camel@amazon.com/
[9] https://lore.kernel.org/lkml/20260209023153.2661784-1-ankur.a.arora@oracle.com/
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: bpf@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-pm@vger.kernel.org
Ankur Arora (12):
asm-generic: barrier: Add smp_cond_load_relaxed_timeout()
arm64: barrier: Support smp_cond_load_relaxed_timeout()
arm64/delay: move some constants out to a separate header
arm64: support WFET in smp_cond_load_relaxed_timeout()
arm64: rqspinlock: Remove private copy of
smp_cond_load_acquire_timewait()
asm-generic: barrier: Add smp_cond_load_acquire_timeout()
atomic: Add atomic_cond_read_*_timeout()
locking/atomic: scripts: build atomic_long_cond_read_*_timeout()
bpf/rqspinlock: switch check_timeout() to a clock interface
bpf/rqspinlock: Use smp_cond_load_acquire_timeout()
sched: add need-resched timed wait interface
cpuidle/poll_state: Wait for need-resched via
tif_need_resched_relaxed_wait()
Documentation/atomic_t.txt | 14 +++--
arch/arm64/Kconfig | 3 +
arch/arm64/include/asm/barrier.h | 23 +++++++
arch/arm64/include/asm/cmpxchg.h | 62 +++++++++++++++----
arch/arm64/include/asm/delay-const.h | 27 +++++++++
arch/arm64/include/asm/rqspinlock.h | 85 --------------------------
arch/arm64/lib/delay.c | 15 ++---
drivers/cpuidle/poll_state.c | 21 +------
drivers/soc/qcom/rpmh-rsc.c | 8 +--
include/asm-generic/barrier.h | 90 ++++++++++++++++++++++++++++
include/linux/atomic.h | 10 ++++
include/linux/atomic/atomic-long.h | 18 +++---
include/linux/sched/idle.h | 29 +++++++++
kernel/bpf/rqspinlock.c | 77 +++++++++++++++---------
scripts/atomic/gen-atomic-long.sh | 16 +++--
15 files changed, 320 insertions(+), 178 deletions(-)
create mode 100644 arch/arm64/include/asm/delay-const.h
--
2.31.1
On Sun, 15 Mar 2026 18:36:39 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote: > Hi, > > This series adds waited variants of the smp_cond_load() primitives: > smp_cond_load_relaxed_timeout(), and smp_cond_load_acquire_timeout(). > > ... > How are we to determine that this change is successful, useful, etc? Reduced CPU consumption? Reduced energy usage? Improved latencies? > > Finally update poll_idle() and resilient queued spinlocks to use them. Have you identified other suitable sites for conversion? > > Documentation/atomic_t.txt | 14 +++-- > arch/arm64/Kconfig | 3 + > arch/arm64/include/asm/barrier.h | 23 +++++++ > arch/arm64/include/asm/cmpxchg.h | 62 +++++++++++++++---- > arch/arm64/include/asm/delay-const.h | 27 +++++++++ > arch/arm64/include/asm/rqspinlock.h | 85 -------------------------- > arch/arm64/lib/delay.c | 15 ++--- > drivers/cpuidle/poll_state.c | 21 +------ > drivers/soc/qcom/rpmh-rsc.c | 8 +-- > include/asm-generic/barrier.h | 90 ++++++++++++++++++++++++++++ > include/linux/atomic.h | 10 ++++ > include/linux/atomic/atomic-long.h | 18 +++--- > include/linux/sched/idle.h | 29 +++++++++ > kernel/bpf/rqspinlock.c | 77 +++++++++++++++--------- > scripts/atomic/gen-atomic-long.sh | 16 +++-- > 15 files changed, 320 insertions(+), 178 deletions(-) > create mode 100644 arch/arm64/include/asm/delay-const.h Some sort of testing in lib/tests/ would be appropriate and useful.
Andrew Morton <akpm@linux-foundation.org> writes:
> On Sun, 15 Mar 2026 18:36:39 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Hi,
>>
>> This series adds waited variants of the smp_cond_load() primitives:
>> smp_cond_load_relaxed_timeout(), and smp_cond_load_acquire_timeout().
>>
>> ...
>>
>
> How are we to determine that this change is successful, useful, etc?
Good point. So this series was split off from this one here:
https://lore.kernel.org/lkml/20250218213337.377987-1-ankur.a.arora@oracle.com/
The series enables ARCH_HAS_CPU_RELAX on arm64 which should allow
relatively cheap polling in idle on arm64.
However, it does need a few more patches from the series above to do that.
> Reduced CPU consumption? Reduced energy usage? Improved latencies?
With the additional patches this should improve wakeup latency:
I ran the sched-pipe test with processes on VCPUs 4 and 5 with
kvm-arm.wfi_trap_policy=notrap.
# perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
perf bench sched pipe -l 1000000 -c 4
# No haltpoll (and, no TIF_POLLING_NRFLAG):
Performance counter stats for 'CPU(s) 4,5' (5 runs):
25,229.57 msec task-clock # 2.000 CPUs utilized ( +- 7.75% )
45,821,250,284 cycles # 1.816 GHz ( +- 10.07% )
26,557,496,665 instructions # 0.58 insn per cycle ( +- 0.21% )
0 sched:sched_wake_idle_without_ipi # 0.000 /sec
12.615 +- 0.977 seconds time elapsed ( +- 7.75% )
# Haltpoll:
Performance counter stats for 'CPU(s) 4,5' (5 runs):
15,131.58 msec task-clock # 2.000 CPUs utilized ( +- 10.00% )
34,158,188,839 cycles # 2.257 GHz ( +- 6.91% )
20,824,950,916 instructions # 0.61 insn per cycle ( +- 0.09% )
1,983,822 sched:sched_wake_idle_without_ipi # 131.105 K/sec ( +- 0.78% )
7.566 +- 0.756 seconds time elapsed ( +- 10.00% )
We get a decent boost just because we are executing ~20% fewer
instructions. Not sure how the cpu frequency scaling works in a VM but
we also run at a higher frequency.
(That specifically applies to guests but that series also adds enables this
with acpi-idle for baremetal.)
(From: https://lore.kernel.org/lkml/877c9zhk68.fsf@oracle.com/)
>> Finally update poll_idle() and resilient queued spinlocks to use them.
>
> Have you identified other suitable sites for conversion?
Haven't found other places in the core kernel where this could be used.
I think one reason is that the typical kernel wait is unbounded.
There are some in drivers/ that have this pattern. For instance I think
this in drivers/iommu/arm/arm-smmu-v3 could be converted:
__arm_smmu_cmdq_poll_until_msi().
However, as David Laight pointed out in this thread
(https://lore.kernel.org/lkml/20260214113122.70627a8b@pumpkin/)
that this would be fine so long as the polling is on memory, but would
need some work to handle MMIO.
>> Documentation/atomic_t.txt | 14 +++--
>> arch/arm64/Kconfig | 3 +
>> arch/arm64/include/asm/barrier.h | 23 +++++++
>> arch/arm64/include/asm/cmpxchg.h | 62 +++++++++++++++----
>> arch/arm64/include/asm/delay-const.h | 27 +++++++++
>> arch/arm64/include/asm/rqspinlock.h | 85 --------------------------
>> arch/arm64/lib/delay.c | 15 ++---
>> drivers/cpuidle/poll_state.c | 21 +------
>> drivers/soc/qcom/rpmh-rsc.c | 8 +--
>> include/asm-generic/barrier.h | 90 ++++++++++++++++++++++++++++
>> include/linux/atomic.h | 10 ++++
>> include/linux/atomic/atomic-long.h | 18 +++---
>> include/linux/sched/idle.h | 29 +++++++++
>> kernel/bpf/rqspinlock.c | 77 +++++++++++++++---------
>> scripts/atomic/gen-atomic-long.sh | 16 +++--
>> 15 files changed, 320 insertions(+), 178 deletions(-)
>> create mode 100644 arch/arm64/include/asm/delay-const.h
>
> Some sort of testing in lib/tests/ would be appropriate and useful.
Makes sense. Will add.
Thanks
--
ankur
On Mon, 16 Mar 2026 15:08:07 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote: ... > However, as David Laight pointed out in this thread > (https://lore.kernel.org/lkml/20260214113122.70627a8b@pumpkin/) > that this would be fine so long as the polling is on memory, but would > need some work to handle MMIO. I'm not sure the current code works with MMIO on arm64. I was looking at the osq_lock() code, it uses smp_cond_load() with 'expr' being 'VAL || need_resched()' expecting to get woken by the IPI associated with the preemption being requested. But the arm64 code relies on 'wfe' being woken when the memory write 'breaks' the 'ldx' for the monitored location. That will only work for cached addresses. For osq_lock(), while an IPI will wake it up, there is also a small timing window where the IPI can happen before the ldx and so not actually wake up it. This is true whenever 'expr' is non-trivial. On arm64 I think you could use explicit sev and wfe - but that will wake all 'sleeping' cpu; and you may not want the 'thundering herd'. David
On Mon, Mar 16, 2026 at 11:37:12PM +0000, David Laight wrote:
> On Mon, 16 Mar 2026 15:08:07 -0700
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
> > However, as David Laight pointed out in this thread
> > (https://lore.kernel.org/lkml/20260214113122.70627a8b@pumpkin/)
> > that this would be fine so long as the polling is on memory, but would
> > need some work to handle MMIO.
>
> I'm not sure the current code works with MMIO on arm64.
It won't but also passing an MMIO pointer to smp_cond_load() is wrong in
general. You'd need a new API that takes an __iomem pointer.
> I was looking at the osq_lock() code, it uses smp_cond_load() with 'expr'
> being 'VAL || need_resched()' expecting to get woken by the IPI associated
> with the preemption being requested.
> But the arm64 code relies on 'wfe' being woken when the memory write
> 'breaks' the 'ldx' for the monitored location.
> That will only work for cached addresses.
Even worse, depending on the hardware, you may even get a data abort
when attempting LDXR on Device memory.
> For osq_lock(), while an IPI will wake it up, there is also a small timing
> window where the IPI can happen before the ldx and so not actually wake up it.
> This is true whenever 'expr' is non-trivial.
Hmm, I thought this is fine because of the implicit SEVL on exception
return but the arm64 __cmpwait_relaxed() does a SEVL+WFE which clears
any prior event, it can wait in theory forever when the event stream is
disabled.
Expanding smp_cond_load_relaxed() into asm, we have something like:
LDR X0, [PTR]
condition check for VAL || need_resched() with branch out
SEVL
WFE
LDXR X1, [PTR]
EOR X1, X1, X0
CBNZ out
WFE
out:
If the condition is updated to become true (need_resched()) after the
condition check but before the first WFE while *PTR remains unchanged,
the IPI won't do anything. Maybe we should revert 1cfc63b5ae60 ("arm64:
cmpwait: Clear event register before arming exclusive monitor"). Not
great but probably better than reverting f5bfdc8e3947 ("locking/osq: Use
optimized spinning loop for arm64")).
Using SEV instead of IPI would have the same problem.
--
Catalin
On Wed, 25 Mar 2026 15:55:21 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Mon, Mar 16, 2026 at 11:37:12PM +0000, David Laight wrote:
...
> > For osq_lock(), while an IPI will wake it up, there is also a small timing
> > window where the IPI can happen before the ldx and so not actually wake up it.
> > This is true whenever 'expr' is non-trivial.
>
> Hmm, I thought this is fine because of the implicit SEVL on exception
> return but the arm64 __cmpwait_relaxed() does a SEVL+WFE which clears
> any prior event, it can wait in theory forever when the event stream is
> disabled.
Not forever, there will be a timer interrupt in the end.
> Expanding smp_cond_load_relaxed() into asm, we have something like:
>
> LDR X0, [PTR]
> condition check for VAL || need_resched() with branch out
> SEVL
> WFE
> LDXR X1, [PTR]
> EOR X1, X1, X0
> CBNZ out
> WFE
> out:
>
> If the condition is updated to become true (need_resched()) after the
> condition check but before the first WFE while *PTR remains unchanged,
> the IPI won't do anything. Maybe we should revert 1cfc63b5ae60 ("arm64:
> cmpwait: Clear event register before arming exclusive monitor"). Not
> great but probably better than reverting f5bfdc8e3947 ("locking/osq: Use
> optimized spinning loop for arm64")).
Could you change the order to:
LDR X0, [PTR]
SEVL
WFE
condition check for VAL || need_resched() with branch out
LDXR X1, [PTR]
EOR X1, X1, X0
CBNZ out
WFE
out:
that closes the timing window for the interrupt provided the condition
check doesn't change the event register.
I must get back to the osq_lock code again.
I'm happy with the code - the per-cpu data is down to two cpu numbers.
(Apart from the acquire/release semantics in a few places.)
But the comments have got out of hand.
Writing succinct and accurate comments is hard - too verbose and they
hide too much code.
David
David Laight <david.laight.linux@gmail.com> writes:
> On Mon, 16 Mar 2026 15:08:07 -0700
> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> ...
>> However, as David Laight pointed out in this thread
>> (https://lore.kernel.org/lkml/20260214113122.70627a8b@pumpkin/)
>> that this would be fine so long as the polling is on memory, but would
>> need some work to handle MMIO.
>
> I'm not sure the current code works with MMIO on arm64.
>
> I was looking at the osq_lock() code, it uses smp_cond_load() with 'expr'
> being 'VAL || need_resched()' expecting to get woken by the IPI associated
> with the preemption being requested.
Yeah, osq_lock() has:
if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
vcpu_is_preempted(node_cpu(node->prev))))
This works on x86 even with a non-stub vcpu_is_preempted() because we
poll on the node->locked while evaluating the conditional.
Won't work on arm64 since there we are just waiting on node->locked.
(And so we depend on the IPI there.)
> But the arm64 code relies on 'wfe' being woken when the memory write
> 'breaks' the 'ldx' for the monitored location.
Agreed.
> That will only work for cached addresses.
Yeah I was forgetting that. LDXR won't really work with non cacheable
memory. So a generic MMIO version might not be possible at all.
> For osq_lock(), while an IPI will wake it up, there is also a small timing
> window where the IPI can happen before the ldx and so not actually wake up it.
> This is true whenever 'expr' is non-trivial.
Yeah because we are checking the state of more than one address in the
conditional but the waiting monitor only does one memory location at a
time. If only CPUs had vectored waiting mechanisms...
I think the smp_cond_load_*() in poll_idle() and in rqspinlock monitor
a single address (and in the conditional) so those two should be okay.
> On arm64 I think you could use explicit sev and wfe - but that will wake all
> 'sleeping' cpu; and you may not want the 'thundering herd'.
Wouldn't we still have the same narrow window where the CPU disregards the IPI?
--
ankur
On Mon, 16 Mar 2026 23:53:22 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote: > David Laight <david.laight.linux@gmail.com> writes: ... > > On arm64 I think you could use explicit sev and wfe - but that will wake all > > 'sleeping' cpu; and you may not want the 'thundering herd'. > > Wouldn't we still have the same narrow window where the CPU disregards the IPI? You need a 'sevl' in the interrupt exit path. (Or, more specifically, the ISR needs to exit with the flag set.) Actually I think you need one anyway, if the ISR clears the flag then the existing code will sleep until the following IRQ if an interrupt happens between the ldx and wfe. I've not looked at the ISR exit code. Ignoring the vcu check (which is fairly broken anyway), I think osq_lock() would be ok if the 'osq node' were in the right cache line. I've some patches pending (I need to sort out lots of comments) that reduce the osq_node down to two cpu numbers; 8 bytes but possibly only 4 although that is harder without 16bit atomics. That would work for arm32 (ldx uses a cache-line resolution) but I'm not sure about similar functionality on other cpu. David > > -- > ankur
On Tue, Mar 17, 2026 at 09:17:05AM +0000, David Laight wrote: > On Mon, 16 Mar 2026 23:53:22 -0700 > Ankur Arora <ankur.a.arora@oracle.com> wrote: > > David Laight <david.laight.linux@gmail.com> writes: > > > On arm64 I think you could use explicit sev and wfe - but that will wake all > > > 'sleeping' cpu; and you may not want the 'thundering herd'. > > > > Wouldn't we still have the same narrow window where the CPU disregards the IPI? > > You need a 'sevl' in the interrupt exit path. No need to, see the rule below in https://developer.arm.com/documentation/ddi0487/maa/2983-beijhbbd: R_XRZRK The Event Register for a PE is set by any of the following: [...] - An exception return. -- Catalin
On Wed, 25 Mar 2026 13:53:50 +0000 Catalin Marinas <catalin.marinas@arm.com> wrote: > On Tue, Mar 17, 2026 at 09:17:05AM +0000, David Laight wrote: > > On Mon, 16 Mar 2026 23:53:22 -0700 > > Ankur Arora <ankur.a.arora@oracle.com> wrote: > > > David Laight <david.laight.linux@gmail.com> writes: > > > > On arm64 I think you could use explicit sev and wfe - but that will wake all > > > > 'sleeping' cpu; and you may not want the 'thundering herd'. > > > > > > Wouldn't we still have the same narrow window where the CPU disregards the IPI? > > > > You need a 'sevl' in the interrupt exit path. > > No need to, see the rule below in > https://developer.arm.com/documentation/ddi0487/maa/2983-beijhbbd: > > R_XRZRK > The Event Register for a PE is set by any of the following: > [...] > - An exception return. > It is a shame the pages for the SEV and WFE instructions don't mention that. And the copy I found doesn't have working hyperlinks to any other sections. (Not even references to related instructions...) You do need to at least comment that the "msr s0_3_c1_c0_0, %[ecycles]" is actually WFET. Is that using an absolute cycle count? If so does it work if the time has already passed? If it is absolute do you need to recalculate it every time around the loop? __delay_cycles() contains guard(preempt_notrace()). I haven't looked what that does but is it needed here since preemption is disabled? Looking at the code I think the "sevl; wfe" pair should be higher up. If they were before the evaluation of the condition then an IPI that set need_resched() just after it was tested would cause a wakeup. Clearly that won't help if the condition does anything that executes 'wfe' and won't sleep if it sets the event - but I suspect they are unlikely. I also wonder how long it takes the cpu to leave any low power state. We definitely found that was an issue on some x86 cpu and had to both disable the lowest low power state and completely rework some wakeup code that really wanted a 'thundering herd' rather than the very gentle 'bring each cpu out of low power one at a time' that cv_broadcast() gave it. David
On Wed, Mar 25, 2026 at 03:42:10PM +0000, David Laight wrote:
> On Wed, 25 Mar 2026 13:53:50 +0000
> Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> > On Tue, Mar 17, 2026 at 09:17:05AM +0000, David Laight wrote:
> > > On Mon, 16 Mar 2026 23:53:22 -0700
> > > Ankur Arora <ankur.a.arora@oracle.com> wrote:
> > > > David Laight <david.laight.linux@gmail.com> writes:
> > > > > On arm64 I think you could use explicit sev and wfe - but that will wake all
> > > > > 'sleeping' cpu; and you may not want the 'thundering herd'.
> > > >
> > > > Wouldn't we still have the same narrow window where the CPU disregards the IPI?
> > >
> > > You need a 'sevl' in the interrupt exit path.
> >
> > No need to, see the rule below in
> > https://developer.arm.com/documentation/ddi0487/maa/2983-beijhbbd:
> >
> > R_XRZRK
> > The Event Register for a PE is set by any of the following:
> > [...]
> > - An exception return.
> >
>
> It is a shame the pages for the SEV and WFE instructions don't mention that.
> And the copy I found doesn't have working hyperlinks to any other sections.
> (Not even references to related instructions...)
The latest architecture spec (M.a.a) has working hyperlinks.
> You do need to at least comment that the "msr s0_3_c1_c0_0, %[ecycles]" is
> actually WFET.
> Is that using an absolute cycle count?
Yes, compared to CNTVCT.
> If so does it work if the time has already passed?
Yes, it exits immediately. These counters are not going to wrap in our
(or device's) lifetime.
> If it is absolute do you need to recalculate it every time around the loop?
No but you do need to read CNTVCT, that's what __delay_cycles() does (it
does not wait).
> __delay_cycles() contains guard(preempt_notrace()). I haven't looked what
> that does but is it needed here since preemption is disabled?
The guard was added recently by commit e5cb94ba5f96 ("arm64: Fix
sampling the "stable" virtual counter in preemptible section"). It's
needed for the udelay() case but probably not for Ankur's series. Maybe
we can move the guard in the caller?
> Looking at the code I think the "sevl; wfe" pair should be higher up.
Yes, I replied to your other message. We could move it higher indeed,
before the condition check, but I can't get my head around the ordering.
Can need_resched() check be speculated before the WFE? I need to think
some more.
> I also wonder how long it takes the cpu to leave any low power state.
> We definitely found that was an issue on some x86 cpu and had to both
> disable the lowest low power state and completely rework some wakeup
> code that really wanted a 'thundering herd' rather than the very gentle
> 'bring each cpu out of low power one at a time' that cv_broadcast()
> gave it.
WFE is a very shallow power state where all hardware state is retained.
We have an even stream broadcast to all CPUs regularly already (10KHz)
and I haven't heard people complaining about power degradation. If a CPU
is a WFI state or even deeper into firmware (following a PSCI call), an
exclusive monitor event won't wake it up. It's only for those cores
waiting in WFE.
--
Catalin
On Wed, 25 Mar 2026 16:32:49 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:
> On Wed, Mar 25, 2026 at 03:42:10PM +0000, David Laight wrote:
...
> > Looking at the code I think the "sevl; wfe" pair should be higher up.
>
> Yes, I replied to your other message. We could move it higher indeed,
> before the condition check, but I can't get my head around the ordering.
> Can need_resched() check be speculated before the WFE? I need to think
> some more.
I don't think speculation can matter.
Both SEVL and WFE must be serialised against any other instructions
that can change the event flag (as well as each other) otherwise
everything is broken.
Apart from that it doesn't matter, what matters is the instruction
boundary the interrupt happens at.
Actually both SEVL and WFE may be synchronising instructions and very slow.
So you may not want to put them in the fast path where the condition
is true on entry (or even true after a retry).
So the code might have to look like:
for (;;) {
VAL = mem;
if (cond(VAL)) return;
SEVL; WFE;
if (cond(VAL)) return;
v1 = LDX(mem);
if (v1 == VAL)
WFE;
}
Definitely needs a comment that both 'return from exception' and
losing the exclusive access set the event flag.
The asm probably ought to have a full "memory" clobber.
Just in case the compiler gets lively with re-ordering.
David
On Wed, Mar 25, 2026 at 08:23:57PM +0000, David Laight wrote:
> On Wed, 25 Mar 2026 16:32:49 +0000
> Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> > On Wed, Mar 25, 2026 at 03:42:10PM +0000, David Laight wrote:
> ...
> > > Looking at the code I think the "sevl; wfe" pair should be higher up.
> >
> > Yes, I replied to your other message. We could move it higher indeed,
> > before the condition check, but I can't get my head around the ordering.
> > Can need_resched() check be speculated before the WFE? I need to think
> > some more.
>
> I don't think speculation can matter.
> Both SEVL and WFE must be serialised against any other instructions
> that can change the event flag (as well as each other) otherwise
> everything is broken.
Welcome to the Arm memory model. We don't have any guarantee that an LDR
will only access memory after SEVL+WFE. They are not serialising.
> Apart from that it doesn't matter, what matters is the instruction
> boundary the interrupt happens at.
True. If an interrupt is taken before the LDR (that would be a
need_resched() check for example), then a prior WFE would not matter.
This won't work if we replace the IPI with a SEV though (suggested
somewhere in this thread).
> Actually both SEVL and WFE may be synchronising instructions and very slow.
Most likely not.
> So you may not want to put them in the fast path where the condition
> is true on entry (or even true after a retry).
> So the code might have to look like:
> for (;;) {
> VAL = mem;
If we only waited on the location passed to LDXR, things would have been
much simpler. But the osq_lock() also wants to wait on the TIF flags via
need_resched() (and vcpu_is_preempted()).
> if (cond(VAL)) return;
So the cond(VAL) here is actually a series of other memory loads
unrelated to 'mem'
> SEVL; WFE;
> if (cond(VAL)) return;
I think this will work in principle even if 'cond' accesses other memory
locations, though I wouldn't bother with an additional 'cond' call, I'd
expect SEVL+WFE to be mostly NOPs. However, 'cond' must not set a local
event, otherwise the power saving on waiting is gone.
> v1 = LDX(mem);
> if (v1 == VAL)
> WFE;
> }
I think it's cleaner to use Ankur's timeout API here for the very rare
case where an IPI hits at the wrong time. We then keep
smp_cond_load_relaxed() intact as it's really not meant to wait on
multiple memory locations to change. Any changes of
smp_cond_load_relaxed() with moving the WFE around are just hacks, not
the intended use of this API.
--
Catalin
© 2016 - 2026 Red Hat, Inc.