arch/arm64/include/asm/barrier.h | 22 ++++++++ arch/arm64/include/asm/rqspinlock.h | 84 +---------------------------- include/asm-generic/barrier.h | 57 ++++++++++++++++++++ include/asm-generic/rqspinlock.h | 4 ++ kernel/bpf/rqspinlock.c | 25 ++++----- 5 files changed, 93 insertions(+), 99 deletions(-)
Hi, This series adds waited variants of the smp_cond_load() primitives: smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait(). Why?: as the name suggests, the new interfaces are meant for contexts where you want to wait on a condition variable for a finite duration. This is easy enough to do with a loop around cpu_relax(). However, some architectures (ex. arm64) also allow waiting on a cacheline. So, these interfaces handle a mixture of spin/wait with a smp_cond_load() thrown in. There are two known users for these interfaces: - poll_idle() [1] - resilient queued spinlocks [2] The interfaces are: smp_cond_load_relaxed_timewait(ptr, cond_expr, time_check_expr) smp_cond_load_acquire_spinwait(ptr, cond_expr, time_check_expr) The added parameter, time_check_expr, determines the bail out condition. Changelog: v3 [3]: - further interface simplifications (suggested by Catalin Marinas) v2 [4]: - simplified the interface (suggested by Catalin Marinas) - get rid of wait_policy, and a multitude of constants - adds a slack parameter This helped remove a fair amount of duplicated code duplication and in hindsight unnecessary constants. v1 [5]: - add wait_policy (coarse and fine) - derive spin-count etc at runtime instead of using arbitrary constants. Haris Okanovic had tested an earlier version of this series with poll_idle()/haltpoll patches. [6] Any comments appreciated! Thanks! Ankur [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/ [2] Uses the smp_cond_load_acquire_timewait() from v1 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h [3] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/ [4] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/ [5] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/ [6] https://lore.kernel.org/lkml/f2f5d09e79539754ced085ed89865787fa668695.camel@amazon.com Cc: Arnd Bergmann <arnd@arndb.de> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: linux-arch@vger.kernel.org Ankur Arora (5): asm-generic: barrier: Add smp_cond_load_relaxed_timewait() arm64: barrier: Add smp_cond_load_relaxed_timewait() arm64: rqspinlock: Remove private copy of smp_cond_load_acquire_timewait asm-generic: barrier: Add smp_cond_load_acquire_timewait() rqspinlock: use smp_cond_load_acquire_timewait() arch/arm64/include/asm/barrier.h | 22 ++++++++ arch/arm64/include/asm/rqspinlock.h | 84 +---------------------------- include/asm-generic/barrier.h | 57 ++++++++++++++++++++ include/asm-generic/rqspinlock.h | 4 ++ kernel/bpf/rqspinlock.c | 25 ++++----- 5 files changed, 93 insertions(+), 99 deletions(-) -- 2.31.1
On Fri, Aug 29, 2025 at 01:07:30AM -0700, Ankur Arora wrote: > Ankur Arora (5): > asm-generic: barrier: Add smp_cond_load_relaxed_timewait() > arm64: barrier: Add smp_cond_load_relaxed_timewait() > arm64: rqspinlock: Remove private copy of > smp_cond_load_acquire_timewait > asm-generic: barrier: Add smp_cond_load_acquire_timewait() > rqspinlock: use smp_cond_load_acquire_timewait() Can you have a go at poll_idle() to see how it would look like using this API? It doesn't necessarily mean we have to merge them all at once but it gives us a better idea of the suitability of the interface. -- Catalin
On Mon, 2025-09-01 at 12:49 +0100, Catalin Marinas wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Fri, Aug 29, 2025 at 01:07:30AM -0700, Ankur Arora wrote: > > Ankur Arora (5): > > asm-generic: barrier: Add smp_cond_load_relaxed_timewait() > > arm64: barrier: Add smp_cond_load_relaxed_timewait() > > arm64: rqspinlock: Remove private copy of > > smp_cond_load_acquire_timewait > > asm-generic: barrier: Add smp_cond_load_acquire_timewait() > > rqspinlock: use smp_cond_load_acquire_timewait() > > Can you have a go at poll_idle() to see how it would look like using > this API? It doesn't necessarily mean we have to merge them all at once > but it gives us a better idea of the suitability of the interface. This is what I tested on ARM on Fri: https://github.com/harisokanovic/linux/blob/37e02b950c99370466e7385e5e754bbd6232ef95/drivers/cpuidle/poll_state.c#L24 Regards, Haris Okanovic AWS Graviton Software > > -- > Catalin
Catalin Marinas <catalin.marinas@arm.com> writes: > On Fri, Aug 29, 2025 at 01:07:30AM -0700, Ankur Arora wrote: >> Ankur Arora (5): >> asm-generic: barrier: Add smp_cond_load_relaxed_timewait() >> arm64: barrier: Add smp_cond_load_relaxed_timewait() >> arm64: rqspinlock: Remove private copy of >> smp_cond_load_acquire_timewait >> asm-generic: barrier: Add smp_cond_load_acquire_timewait() >> rqspinlock: use smp_cond_load_acquire_timewait() > > Can you have a go at poll_idle() to see how it would look like using > this API? It doesn't necessarily mean we have to merge them all at once > but it gives us a better idea of the suitability of the interface. So, I've been testing with some version of the following: diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c index 9b6d90a72601..361879396d0c 100644 --- a/drivers/cpuidle/poll_state.c +++ b/drivers/cpuidle/poll_state.c @@ -8,35 +8,25 @@ #include <linux/sched/clock.h> #include <linux/sched/idle.h> -#define POLL_IDLE_RELAX_COUNT 200 - static int __cpuidle poll_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { - u64 time_start; - - time_start = local_clock_noinstr(); + unsigned long flags; dev->poll_time_limit = false; raw_local_irq_enable(); if (!current_set_polling_and_test()) { - unsigned int loop_count = 0; - u64 limit; + u64 limit, time_end; limit = cpuidle_poll_time(drv, dev); + time_end = local_clock_noinstr() + limit; - while (!need_resched()) { - cpu_relax(); - if (loop_count++ < POLL_IDLE_RELAX_COUNT) - continue; + flags = smp_cond_load_relaxed_timewait(¤t_thread_info()->flags, + VAL & _TIF_NEED_RESCHED, + (local_clock_noinstr() >= time_end)); - loop_count = 0; - if (local_clock_noinstr() - time_start > limit) { - dev->poll_time_limit = true; - break; - } - } + dev->poll_time_limit = (local_clock_noinstr() >= time_end); } raw_local_irq_disable(); With that, poll_idle() is: static int __cpuidle poll_idle(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) { unsigned long flags; dev->poll_time_limit = false; raw_local_irq_enable(); if (!current_set_polling_and_test()) { u64 limit, time_end; limit = cpuidle_poll_time(drv, dev); time_end = local_clock_noinstr() + limit; flags = smp_cond_load_relaxed_timewait(¤t_thread_info()->flags, VAL & _TIF_NEED_RESCHED, (local_clock_noinstr() >= time_end)); dev->poll_time_limit = (local_clock_noinstr() >= time_end); } raw_local_irq_disable(); current_clr_polling(); return index; } -- ankur
On Tue, Sep 02, 2025 at 03:46:52PM -0700, Ankur Arora wrote: > Catalin Marinas <catalin.marinas@arm.com> writes: > > Can you have a go at poll_idle() to see how it would look like using > > this API? It doesn't necessarily mean we have to merge them all at once > > but it gives us a better idea of the suitability of the interface. > > So, I've been testing with some version of the following: > > diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c > index 9b6d90a72601..361879396d0c 100644 > --- a/drivers/cpuidle/poll_state.c > +++ b/drivers/cpuidle/poll_state.c > @@ -8,35 +8,25 @@ > #include <linux/sched/clock.h> > #include <linux/sched/idle.h> > > -#define POLL_IDLE_RELAX_COUNT 200 > - > static int __cpuidle poll_idle(struct cpuidle_device *dev, > struct cpuidle_driver *drv, int index) > { > - u64 time_start; > - > - time_start = local_clock_noinstr(); > + unsigned long flags; > > dev->poll_time_limit = false; > > raw_local_irq_enable(); > if (!current_set_polling_and_test()) { > - unsigned int loop_count = 0; > - u64 limit; > + u64 limit, time_end; > > limit = cpuidle_poll_time(drv, dev); > + time_end = local_clock_noinstr() + limit; > > - while (!need_resched()) { > - cpu_relax(); > - if (loop_count++ < POLL_IDLE_RELAX_COUNT) > - continue; > + flags = smp_cond_load_relaxed_timewait(¤t_thread_info()->flags, > + VAL & _TIF_NEED_RESCHED, > + (local_clock_noinstr() >= time_end)); It makes sense to have the non-strict comparison, though it changes the original behaviour slightly. Just mention it in the commit log. > > - loop_count = 0; > - if (local_clock_noinstr() - time_start > limit) { > - dev->poll_time_limit = true; > - break; > - } > - } > + dev->poll_time_limit = (local_clock_noinstr() >= time_end); Could we do this instead and avoid another clock read: dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED); In the original code, it made sense since it had to check the clock anyway and break the loop. When you repost, please include the rqspinlock and poll_idle changes as well to show how the interface is used. -- Catalin
Catalin Marinas <catalin.marinas@arm.com> writes: > On Tue, Sep 02, 2025 at 03:46:52PM -0700, Ankur Arora wrote: >> Catalin Marinas <catalin.marinas@arm.com> writes: >> > Can you have a go at poll_idle() to see how it would look like using >> > this API? It doesn't necessarily mean we have to merge them all at once >> > but it gives us a better idea of the suitability of the interface. >> >> So, I've been testing with some version of the following: >> >> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c >> index 9b6d90a72601..361879396d0c 100644 >> --- a/drivers/cpuidle/poll_state.c >> +++ b/drivers/cpuidle/poll_state.c >> @@ -8,35 +8,25 @@ >> #include <linux/sched/clock.h> >> #include <linux/sched/idle.h> >> >> -#define POLL_IDLE_RELAX_COUNT 200 >> - >> static int __cpuidle poll_idle(struct cpuidle_device *dev, >> struct cpuidle_driver *drv, int index) >> { >> - u64 time_start; >> - >> - time_start = local_clock_noinstr(); >> + unsigned long flags; >> >> dev->poll_time_limit = false; >> >> raw_local_irq_enable(); >> if (!current_set_polling_and_test()) { >> - unsigned int loop_count = 0; >> - u64 limit; >> + u64 limit, time_end; >> >> limit = cpuidle_poll_time(drv, dev); >> + time_end = local_clock_noinstr() + limit; >> >> - while (!need_resched()) { >> - cpu_relax(); >> - if (loop_count++ < POLL_IDLE_RELAX_COUNT) >> - continue; >> + flags = smp_cond_load_relaxed_timewait(¤t_thread_info()->flags, >> + VAL & _TIF_NEED_RESCHED, >> + (local_clock_noinstr() >= time_end)); > > It makes sense to have the non-strict comparison, though it changes the > original behaviour slightly. Just mention it in the commit log. > >> >> - loop_count = 0; >> - if (local_clock_noinstr() - time_start > limit) { >> - dev->poll_time_limit = true; >> - break; >> - } >> - } >> + dev->poll_time_limit = (local_clock_noinstr() >= time_end); > > Could we do this instead and avoid another clock read: > > dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED); > > In the original code, it made sense since it had to check the clock > anyway and break the loop. > > When you repost, please include the rqspinlock and poll_idle changes as > well to show how the interface is used. Sure. -- ankur
On Fri, 2025-08-29 at 01:07 -0700, Ankur Arora wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Hi, > > This series adds waited variants of the smp_cond_load() primitives: > smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait(). > > Why?: as the name suggests, the new interfaces are meant for contexts > where you want to wait on a condition variable for a finite duration. > This is easy enough to do with a loop around cpu_relax(). However, > some architectures (ex. arm64) also allow waiting on a cacheline. So, > these interfaces handle a mixture of spin/wait with a smp_cond_load() > thrown in. > > There are two known users for these interfaces: > > - poll_idle() [1] > - resilient queued spinlocks [2] > > The interfaces are: > smp_cond_load_relaxed_timewait(ptr, cond_expr, time_check_expr) > smp_cond_load_acquire_spinwait(ptr, cond_expr, time_check_expr) > > The added parameter, time_check_expr, determines the bail out condition. > > Changelog: > v3 [3]: > - further interface simplifications (suggested by Catalin Marinas) > > v2 [4]: > - simplified the interface (suggested by Catalin Marinas) > - get rid of wait_policy, and a multitude of constants > - adds a slack parameter > This helped remove a fair amount of duplicated code duplication and in hindsight > unnecessary constants. > > v1 [5]: > - add wait_policy (coarse and fine) > - derive spin-count etc at runtime instead of using arbitrary > constants. > > Haris Okanovic had tested an earlier version of this series with > poll_idle()/haltpoll patches. [6] > > Any comments appreciated! > > Thanks! > Ankur > > [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/ > [2] Uses the smp_cond_load_acquire_timewait() from v1 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h > [3] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/ > [4] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/ > [5] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/ > [6] https://lore.kernel.org/lkml/f2f5d09e79539754ced085ed89865787fa668695.camel@amazon.com > > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Will Deacon <will@kernel.org> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> > Cc: Alexei Starovoitov <ast@kernel.org> > Cc: linux-arch@vger.kernel.org > > Ankur Arora (5): > asm-generic: barrier: Add smp_cond_load_relaxed_timewait() > arm64: barrier: Add smp_cond_load_relaxed_timewait() > arm64: rqspinlock: Remove private copy of > smp_cond_load_acquire_timewait > asm-generic: barrier: Add smp_cond_load_acquire_timewait() > rqspinlock: use smp_cond_load_acquire_timewait() > > arch/arm64/include/asm/barrier.h | 22 ++++++++ > arch/arm64/include/asm/rqspinlock.h | 84 +---------------------------- > include/asm-generic/barrier.h | 57 ++++++++++++++++++++ > include/asm-generic/rqspinlock.h | 4 ++ > kernel/bpf/rqspinlock.c | 25 ++++----- > 5 files changed, 93 insertions(+), 99 deletions(-) > > -- > 2.31.1 > Tested on AWS Graviton 2, 3, and 4 (ARM64 Neoverse N1, V1, and V2) with your V10 haltpoll changes, atop 6.17.0-rc3 (commit 07d9df8008). Still seeing between 1.3x and 2.5x speedups in `perf bench sched pipe` and `seccomp-notify`; no change in `messaging`. Reviewed-by: Haris Okanovic <harisokn@amazon.com> Tested-by: Haris Okanovic <harisokn@amazon.com> Regards, Haris Okanovic AWS Graviton Software
Okanovic, Haris <harisokn@amazon.com> writes: > On Fri, 2025-08-29 at 01:07 -0700, Ankur Arora wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> Hi, >> >> This series adds waited variants of the smp_cond_load() primitives: >> smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait(). >> >> Why?: as the name suggests, the new interfaces are meant for contexts >> where you want to wait on a condition variable for a finite duration. >> This is easy enough to do with a loop around cpu_relax(). However, >> some architectures (ex. arm64) also allow waiting on a cacheline. So, >> these interfaces handle a mixture of spin/wait with a smp_cond_load() >> thrown in. >> >> There are two known users for these interfaces: >> >> - poll_idle() [1] >> - resilient queued spinlocks [2] >> >> The interfaces are: >> smp_cond_load_relaxed_timewait(ptr, cond_expr, time_check_expr) >> smp_cond_load_acquire_spinwait(ptr, cond_expr, time_check_expr) >> >> The added parameter, time_check_expr, determines the bail out condition. >> >> Changelog: >> v3 [3]: >> - further interface simplifications (suggested by Catalin Marinas) >> >> v2 [4]: >> - simplified the interface (suggested by Catalin Marinas) >> - get rid of wait_policy, and a multitude of constants >> - adds a slack parameter >> This helped remove a fair amount of duplicated code duplication and in hindsight >> unnecessary constants. >> >> v1 [5]: >> - add wait_policy (coarse and fine) >> - derive spin-count etc at runtime instead of using arbitrary >> constants. >> >> Haris Okanovic had tested an earlier version of this series with >> poll_idle()/haltpoll patches. [6] >> >> Any comments appreciated! >> >> Thanks! >> Ankur >> >> [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/ >> [2] Uses the smp_cond_load_acquire_timewait() from v1 >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h >> [3] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/ >> [4] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/ >> [5] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/ >> [6] https://lore.kernel.org/lkml/f2f5d09e79539754ced085ed89865787fa668695.camel@amazon.com >> >> Cc: Arnd Bergmann <arnd@arndb.de> >> Cc: Will Deacon <will@kernel.org> >> Cc: Catalin Marinas <catalin.marinas@arm.com> >> Cc: Peter Zijlstra <peterz@infradead.org> >> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> >> Cc: Alexei Starovoitov <ast@kernel.org> >> Cc: linux-arch@vger.kernel.org >> >> Ankur Arora (5): >> asm-generic: barrier: Add smp_cond_load_relaxed_timewait() >> arm64: barrier: Add smp_cond_load_relaxed_timewait() >> arm64: rqspinlock: Remove private copy of >> smp_cond_load_acquire_timewait >> asm-generic: barrier: Add smp_cond_load_acquire_timewait() >> rqspinlock: use smp_cond_load_acquire_timewait() >> >> arch/arm64/include/asm/barrier.h | 22 ++++++++ >> arch/arm64/include/asm/rqspinlock.h | 84 +---------------------------- >> include/asm-generic/barrier.h | 57 ++++++++++++++++++++ >> include/asm-generic/rqspinlock.h | 4 ++ >> kernel/bpf/rqspinlock.c | 25 ++++----- >> 5 files changed, 93 insertions(+), 99 deletions(-) >> >> -- >> 2.31.1 >> > > Tested on AWS Graviton 2, 3, and 4 (ARM64 Neoverse N1, V1, and V2) with > your V10 haltpoll changes, atop 6.17.0-rc3 (commit 07d9df8008). > Still seeing between 1.3x and 2.5x speedups in `perf bench sched pipe` > and `seccomp-notify`; no change in `messaging`. Great. > Reviewed-by: Haris Okanovic <harisokn@amazon.com> > Tested-by: Haris Okanovic <harisokn@amazon.com> Thank you. -- ankur
© 2016 - 2025 Red Hat, Inc.