arch/arm64/include/asm/barrier.h | 22 ++++++++ arch/arm64/include/asm/rqspinlock.h | 84 +---------------------------- include/asm-generic/barrier.h | 57 ++++++++++++++++++++ include/asm-generic/rqspinlock.h | 4 ++ kernel/bpf/rqspinlock.c | 25 ++++----- 5 files changed, 93 insertions(+), 99 deletions(-)
Hi,
This series adds waited variants of the smp_cond_load() primitives:
smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().
Why?: as the name suggests, the new interfaces are meant for contexts
where you want to wait on a condition variable for a finite duration.
This is easy enough to do with a loop around cpu_relax(). However,
some architectures (ex. arm64) also allow waiting on a cacheline. So,
these interfaces handle a mixture of spin/wait with a smp_cond_load()
thrown in.
There are two known users for these interfaces:
- poll_idle() [1]
- resilient queued spinlocks [2]
The interfaces are:
smp_cond_load_relaxed_timewait(ptr, cond_expr, time_check_expr)
smp_cond_load_acquire_spinwait(ptr, cond_expr, time_check_expr)
The added parameter, time_check_expr, determines the bail out condition.
Changelog:
v3 [3]:
- further interface simplifications (suggested by Catalin Marinas)
v2 [4]:
- simplified the interface (suggested by Catalin Marinas)
- get rid of wait_policy, and a multitude of constants
- adds a slack parameter
This helped remove a fair amount of duplicated code duplication and in hindsight
unnecessary constants.
v1 [5]:
- add wait_policy (coarse and fine)
- derive spin-count etc at runtime instead of using arbitrary
constants.
Haris Okanovic had tested an earlier version of this series with
poll_idle()/haltpoll patches. [6]
Any comments appreciated!
Thanks!
Ankur
[1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
[2] Uses the smp_cond_load_acquire_timewait() from v1
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h
[3] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/
[4] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/
[5] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
[6] https://lore.kernel.org/lkml/f2f5d09e79539754ced085ed89865787fa668695.camel@amazon.com
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: linux-arch@vger.kernel.org
Ankur Arora (5):
asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
arm64: barrier: Add smp_cond_load_relaxed_timewait()
arm64: rqspinlock: Remove private copy of
smp_cond_load_acquire_timewait
asm-generic: barrier: Add smp_cond_load_acquire_timewait()
rqspinlock: use smp_cond_load_acquire_timewait()
arch/arm64/include/asm/barrier.h | 22 ++++++++
arch/arm64/include/asm/rqspinlock.h | 84 +----------------------------
include/asm-generic/barrier.h | 57 ++++++++++++++++++++
include/asm-generic/rqspinlock.h | 4 ++
kernel/bpf/rqspinlock.c | 25 ++++-----
5 files changed, 93 insertions(+), 99 deletions(-)
--
2.31.1
On Fri, Aug 29, 2025 at 01:07:30AM -0700, Ankur Arora wrote: > Ankur Arora (5): > asm-generic: barrier: Add smp_cond_load_relaxed_timewait() > arm64: barrier: Add smp_cond_load_relaxed_timewait() > arm64: rqspinlock: Remove private copy of > smp_cond_load_acquire_timewait > asm-generic: barrier: Add smp_cond_load_acquire_timewait() > rqspinlock: use smp_cond_load_acquire_timewait() Can you have a go at poll_idle() to see how it would look like using this API? It doesn't necessarily mean we have to merge them all at once but it gives us a better idea of the suitability of the interface. -- Catalin
On Mon, 2025-09-01 at 12:49 +0100, Catalin Marinas wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Fri, Aug 29, 2025 at 01:07:30AM -0700, Ankur Arora wrote: > > Ankur Arora (5): > > asm-generic: barrier: Add smp_cond_load_relaxed_timewait() > > arm64: barrier: Add smp_cond_load_relaxed_timewait() > > arm64: rqspinlock: Remove private copy of > > smp_cond_load_acquire_timewait > > asm-generic: barrier: Add smp_cond_load_acquire_timewait() > > rqspinlock: use smp_cond_load_acquire_timewait() > > Can you have a go at poll_idle() to see how it would look like using > this API? It doesn't necessarily mean we have to merge them all at once > but it gives us a better idea of the suitability of the interface. This is what I tested on ARM on Fri: https://github.com/harisokanovic/linux/blob/37e02b950c99370466e7385e5e754bbd6232ef95/drivers/cpuidle/poll_state.c#L24 Regards, Haris Okanovic AWS Graviton Software > > -- > Catalin
Catalin Marinas <catalin.marinas@arm.com> writes:
> On Fri, Aug 29, 2025 at 01:07:30AM -0700, Ankur Arora wrote:
>> Ankur Arora (5):
>> asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
>> arm64: barrier: Add smp_cond_load_relaxed_timewait()
>> arm64: rqspinlock: Remove private copy of
>> smp_cond_load_acquire_timewait
>> asm-generic: barrier: Add smp_cond_load_acquire_timewait()
>> rqspinlock: use smp_cond_load_acquire_timewait()
>
> Can you have a go at poll_idle() to see how it would look like using
> this API? It doesn't necessarily mean we have to merge them all at once
> but it gives us a better idea of the suitability of the interface.
So, I've been testing with some version of the following:
diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index 9b6d90a72601..361879396d0c 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -8,35 +8,25 @@
#include <linux/sched/clock.h>
#include <linux/sched/idle.h>
-#define POLL_IDLE_RELAX_COUNT 200
-
static int __cpuidle poll_idle(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index)
{
- u64 time_start;
-
- time_start = local_clock_noinstr();
+ unsigned long flags;
dev->poll_time_limit = false;
raw_local_irq_enable();
if (!current_set_polling_and_test()) {
- unsigned int loop_count = 0;
- u64 limit;
+ u64 limit, time_end;
limit = cpuidle_poll_time(drv, dev);
+ time_end = local_clock_noinstr() + limit;
- while (!need_resched()) {
- cpu_relax();
- if (loop_count++ < POLL_IDLE_RELAX_COUNT)
- continue;
+ flags = smp_cond_load_relaxed_timewait(¤t_thread_info()->flags,
+ VAL & _TIF_NEED_RESCHED,
+ (local_clock_noinstr() >= time_end));
- loop_count = 0;
- if (local_clock_noinstr() - time_start > limit) {
- dev->poll_time_limit = true;
- break;
- }
- }
+ dev->poll_time_limit = (local_clock_noinstr() >= time_end);
}
raw_local_irq_disable();
With that, poll_idle() is:
static int __cpuidle poll_idle(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index)
{
unsigned long flags;
dev->poll_time_limit = false;
raw_local_irq_enable();
if (!current_set_polling_and_test()) {
u64 limit, time_end;
limit = cpuidle_poll_time(drv, dev);
time_end = local_clock_noinstr() + limit;
flags = smp_cond_load_relaxed_timewait(¤t_thread_info()->flags,
VAL & _TIF_NEED_RESCHED,
(local_clock_noinstr() >= time_end));
dev->poll_time_limit = (local_clock_noinstr() >= time_end);
}
raw_local_irq_disable();
current_clr_polling();
return index;
}
--
ankur
On Tue, Sep 02, 2025 at 03:46:52PM -0700, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@arm.com> writes:
> > Can you have a go at poll_idle() to see how it would look like using
> > this API? It doesn't necessarily mean we have to merge them all at once
> > but it gives us a better idea of the suitability of the interface.
>
> So, I've been testing with some version of the following:
>
> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> index 9b6d90a72601..361879396d0c 100644
> --- a/drivers/cpuidle/poll_state.c
> +++ b/drivers/cpuidle/poll_state.c
> @@ -8,35 +8,25 @@
> #include <linux/sched/clock.h>
> #include <linux/sched/idle.h>
>
> -#define POLL_IDLE_RELAX_COUNT 200
> -
> static int __cpuidle poll_idle(struct cpuidle_device *dev,
> struct cpuidle_driver *drv, int index)
> {
> - u64 time_start;
> -
> - time_start = local_clock_noinstr();
> + unsigned long flags;
>
> dev->poll_time_limit = false;
>
> raw_local_irq_enable();
> if (!current_set_polling_and_test()) {
> - unsigned int loop_count = 0;
> - u64 limit;
> + u64 limit, time_end;
>
> limit = cpuidle_poll_time(drv, dev);
> + time_end = local_clock_noinstr() + limit;
>
> - while (!need_resched()) {
> - cpu_relax();
> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> - continue;
> + flags = smp_cond_load_relaxed_timewait(¤t_thread_info()->flags,
> + VAL & _TIF_NEED_RESCHED,
> + (local_clock_noinstr() >= time_end));
It makes sense to have the non-strict comparison, though it changes the
original behaviour slightly. Just mention it in the commit log.
>
> - loop_count = 0;
> - if (local_clock_noinstr() - time_start > limit) {
> - dev->poll_time_limit = true;
> - break;
> - }
> - }
> + dev->poll_time_limit = (local_clock_noinstr() >= time_end);
Could we do this instead and avoid another clock read:
dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED);
In the original code, it made sense since it had to check the clock
anyway and break the loop.
When you repost, please include the rqspinlock and poll_idle changes as
well to show how the interface is used.
--
Catalin
Catalin Marinas <catalin.marinas@arm.com> writes:
> On Tue, Sep 02, 2025 at 03:46:52PM -0700, Ankur Arora wrote:
>> Catalin Marinas <catalin.marinas@arm.com> writes:
>> > Can you have a go at poll_idle() to see how it would look like using
>> > this API? It doesn't necessarily mean we have to merge them all at once
>> > but it gives us a better idea of the suitability of the interface.
>>
>> So, I've been testing with some version of the following:
>>
>> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> index 9b6d90a72601..361879396d0c 100644
>> --- a/drivers/cpuidle/poll_state.c
>> +++ b/drivers/cpuidle/poll_state.c
>> @@ -8,35 +8,25 @@
>> #include <linux/sched/clock.h>
>> #include <linux/sched/idle.h>
>>
>> -#define POLL_IDLE_RELAX_COUNT 200
>> -
>> static int __cpuidle poll_idle(struct cpuidle_device *dev,
>> struct cpuidle_driver *drv, int index)
>> {
>> - u64 time_start;
>> -
>> - time_start = local_clock_noinstr();
>> + unsigned long flags;
>>
>> dev->poll_time_limit = false;
>>
>> raw_local_irq_enable();
>> if (!current_set_polling_and_test()) {
>> - unsigned int loop_count = 0;
>> - u64 limit;
>> + u64 limit, time_end;
>>
>> limit = cpuidle_poll_time(drv, dev);
>> + time_end = local_clock_noinstr() + limit;
>>
>> - while (!need_resched()) {
>> - cpu_relax();
>> - if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> - continue;
>> + flags = smp_cond_load_relaxed_timewait(¤t_thread_info()->flags,
>> + VAL & _TIF_NEED_RESCHED,
>> + (local_clock_noinstr() >= time_end));
>
> It makes sense to have the non-strict comparison, though it changes the
> original behaviour slightly. Just mention it in the commit log.
>
>>
>> - loop_count = 0;
>> - if (local_clock_noinstr() - time_start > limit) {
>> - dev->poll_time_limit = true;
>> - break;
>> - }
>> - }
>> + dev->poll_time_limit = (local_clock_noinstr() >= time_end);
>
> Could we do this instead and avoid another clock read:
>
> dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED);
>
> In the original code, it made sense since it had to check the clock
> anyway and break the loop.
>
> When you repost, please include the rqspinlock and poll_idle changes as
> well to show how the interface is used.
Sure.
--
ankur
On Fri, 2025-08-29 at 01:07 -0700, Ankur Arora wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Hi, > > This series adds waited variants of the smp_cond_load() primitives: > smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait(). > > Why?: as the name suggests, the new interfaces are meant for contexts > where you want to wait on a condition variable for a finite duration. > This is easy enough to do with a loop around cpu_relax(). However, > some architectures (ex. arm64) also allow waiting on a cacheline. So, > these interfaces handle a mixture of spin/wait with a smp_cond_load() > thrown in. > > There are two known users for these interfaces: > > - poll_idle() [1] > - resilient queued spinlocks [2] > > The interfaces are: > smp_cond_load_relaxed_timewait(ptr, cond_expr, time_check_expr) > smp_cond_load_acquire_spinwait(ptr, cond_expr, time_check_expr) > > The added parameter, time_check_expr, determines the bail out condition. > > Changelog: > v3 [3]: > - further interface simplifications (suggested by Catalin Marinas) > > v2 [4]: > - simplified the interface (suggested by Catalin Marinas) > - get rid of wait_policy, and a multitude of constants > - adds a slack parameter > This helped remove a fair amount of duplicated code duplication and in hindsight > unnecessary constants. > > v1 [5]: > - add wait_policy (coarse and fine) > - derive spin-count etc at runtime instead of using arbitrary > constants. > > Haris Okanovic had tested an earlier version of this series with > poll_idle()/haltpoll patches. [6] > > Any comments appreciated! > > Thanks! > Ankur > > [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/ > [2] Uses the smp_cond_load_acquire_timewait() from v1 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h > [3] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/ > [4] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/ > [5] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/ > [6] https://lore.kernel.org/lkml/f2f5d09e79539754ced085ed89865787fa668695.camel@amazon.com > > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Will Deacon <will@kernel.org> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> > Cc: Alexei Starovoitov <ast@kernel.org> > Cc: linux-arch@vger.kernel.org > > Ankur Arora (5): > asm-generic: barrier: Add smp_cond_load_relaxed_timewait() > arm64: barrier: Add smp_cond_load_relaxed_timewait() > arm64: rqspinlock: Remove private copy of > smp_cond_load_acquire_timewait > asm-generic: barrier: Add smp_cond_load_acquire_timewait() > rqspinlock: use smp_cond_load_acquire_timewait() > > arch/arm64/include/asm/barrier.h | 22 ++++++++ > arch/arm64/include/asm/rqspinlock.h | 84 +---------------------------- > include/asm-generic/barrier.h | 57 ++++++++++++++++++++ > include/asm-generic/rqspinlock.h | 4 ++ > kernel/bpf/rqspinlock.c | 25 ++++----- > 5 files changed, 93 insertions(+), 99 deletions(-) > > -- > 2.31.1 > Tested on AWS Graviton 2, 3, and 4 (ARM64 Neoverse N1, V1, and V2) with your V10 haltpoll changes, atop 6.17.0-rc3 (commit 07d9df8008). Still seeing between 1.3x and 2.5x speedups in `perf bench sched pipe` and `seccomp-notify`; no change in `messaging`. Reviewed-by: Haris Okanovic <harisokn@amazon.com> Tested-by: Haris Okanovic <harisokn@amazon.com> Regards, Haris Okanovic AWS Graviton Software
Okanovic, Haris <harisokn@amazon.com> writes: > On Fri, 2025-08-29 at 01:07 -0700, Ankur Arora wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> Hi, >> >> This series adds waited variants of the smp_cond_load() primitives: >> smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait(). >> >> Why?: as the name suggests, the new interfaces are meant for contexts >> where you want to wait on a condition variable for a finite duration. >> This is easy enough to do with a loop around cpu_relax(). However, >> some architectures (ex. arm64) also allow waiting on a cacheline. So, >> these interfaces handle a mixture of spin/wait with a smp_cond_load() >> thrown in. >> >> There are two known users for these interfaces: >> >> - poll_idle() [1] >> - resilient queued spinlocks [2] >> >> The interfaces are: >> smp_cond_load_relaxed_timewait(ptr, cond_expr, time_check_expr) >> smp_cond_load_acquire_spinwait(ptr, cond_expr, time_check_expr) >> >> The added parameter, time_check_expr, determines the bail out condition. >> >> Changelog: >> v3 [3]: >> - further interface simplifications (suggested by Catalin Marinas) >> >> v2 [4]: >> - simplified the interface (suggested by Catalin Marinas) >> - get rid of wait_policy, and a multitude of constants >> - adds a slack parameter >> This helped remove a fair amount of duplicated code duplication and in hindsight >> unnecessary constants. >> >> v1 [5]: >> - add wait_policy (coarse and fine) >> - derive spin-count etc at runtime instead of using arbitrary >> constants. >> >> Haris Okanovic had tested an earlier version of this series with >> poll_idle()/haltpoll patches. [6] >> >> Any comments appreciated! >> >> Thanks! >> Ankur >> >> [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/ >> [2] Uses the smp_cond_load_acquire_timewait() from v1 >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h >> [3] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/ >> [4] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/ >> [5] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/ >> [6] https://lore.kernel.org/lkml/f2f5d09e79539754ced085ed89865787fa668695.camel@amazon.com >> >> Cc: Arnd Bergmann <arnd@arndb.de> >> Cc: Will Deacon <will@kernel.org> >> Cc: Catalin Marinas <catalin.marinas@arm.com> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> >> Cc: Alexei Starovoitov <ast@kernel.org> >> Cc: linux-arch@vger.kernel.org >> >> Ankur Arora (5): >> asm-generic: barrier: Add smp_cond_load_relaxed_timewait() >> arm64: barrier: Add smp_cond_load_relaxed_timewait() >> arm64: rqspinlock: Remove private copy of >> smp_cond_load_acquire_timewait >> asm-generic: barrier: Add smp_cond_load_acquire_timewait() >> rqspinlock: use smp_cond_load_acquire_timewait() >> >> arch/arm64/include/asm/barrier.h | 22 ++++++++ >> arch/arm64/include/asm/rqspinlock.h | 84 +---------------------------- >> include/asm-generic/barrier.h | 57 ++++++++++++++++++++ >> include/asm-generic/rqspinlock.h | 4 ++ >> kernel/bpf/rqspinlock.c | 25 ++++----- >> 5 files changed, 93 insertions(+), 99 deletions(-) >> >> -- >> 2.31.1 >> > > Tested on AWS Graviton 2, 3, and 4 (ARM64 Neoverse N1, V1, and V2) with > your V10 haltpoll changes, atop 6.17.0-rc3 (commit 07d9df8008). > Still seeing between 1.3x and 2.5x speedups in `perf bench sched pipe` > and `seccomp-notify`; no change in `messaging`. Great. > Reviewed-by: Haris Okanovic <harisokn@amazon.com> > Tested-by: Haris Okanovic <harisokn@amazon.com> Thank you. -- ankur
© 2016 - 2026 Red Hat, Inc.