[PATCH v4 0/5] barrier: Add smp_cond_load_*_timewait()

Ankur Arora posted 5 patches 1 month ago
arch/arm64/include/asm/barrier.h    | 22 ++++++++
arch/arm64/include/asm/rqspinlock.h | 84 +----------------------------
include/asm-generic/barrier.h       | 57 ++++++++++++++++++++
include/asm-generic/rqspinlock.h    |  4 ++
kernel/bpf/rqspinlock.c             | 25 ++++-----
5 files changed, 93 insertions(+), 99 deletions(-)
[PATCH v4 0/5] barrier: Add smp_cond_load_*_timewait()
Posted by Ankur Arora 1 month ago
Hi,

This series adds waited variants of the smp_cond_load() primitives:
smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().

Why?: as the name suggests, the new interfaces are meant for contexts
where you want to wait on a condition variable for a finite duration.
This is easy enough to do with a loop around cpu_relax(). However,
some architectures (ex. arm64) also allow waiting on a cacheline. So,
these interfaces handle a mixture of spin/wait with a smp_cond_load()
thrown in.

There are two known users for these interfaces:

 - poll_idle() [1]
 - resilient queued spinlocks [2]

The interfaces are:
   smp_cond_load_relaxed_timewait(ptr, cond_expr, time_check_expr)
   smp_cond_load_acquire_spinwait(ptr, cond_expr, time_check_expr)

The added parameter, time_check_expr, determines the bail out condition.

Changelog:
  v3 [3]:
    - further interface simplifications (suggested by Catalin Marinas)

  v2 [4]:
    - simplified the interface (suggested by Catalin Marinas)
       - get rid of wait_policy, and a multitude of constants
       - adds a slack parameter
      This helped remove a fair amount of duplicated code duplication and in hindsight
      unnecessary constants.

  v1 [5]:
     - add wait_policy (coarse and fine)
     - derive spin-count etc at runtime instead of using arbitrary
       constants.

Haris Okanovic had tested an earlier version of this series with 
poll_idle()/haltpoll patches. [6]

Any comments appreciated!

Thanks!
Ankur

[1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
[2] Uses the smp_cond_load_acquire_timewait() from v1
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h
[3] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/
[4] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/
[5] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
[6] https://lore.kernel.org/lkml/f2f5d09e79539754ced085ed89865787fa668695.camel@amazon.com

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: linux-arch@vger.kernel.org

Ankur Arora (5):
  asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
  arm64: barrier: Add smp_cond_load_relaxed_timewait()
  arm64: rqspinlock: Remove private copy of
    smp_cond_load_acquire_timewait
  asm-generic: barrier: Add smp_cond_load_acquire_timewait()
  rqspinlock: use smp_cond_load_acquire_timewait()

 arch/arm64/include/asm/barrier.h    | 22 ++++++++
 arch/arm64/include/asm/rqspinlock.h | 84 +----------------------------
 include/asm-generic/barrier.h       | 57 ++++++++++++++++++++
 include/asm-generic/rqspinlock.h    |  4 ++
 kernel/bpf/rqspinlock.c             | 25 ++++-----
 5 files changed, 93 insertions(+), 99 deletions(-)

-- 
2.31.1
Re: [PATCH v4 0/5] barrier: Add smp_cond_load_*_timewait()
Posted by Catalin Marinas 1 month ago
On Fri, Aug 29, 2025 at 01:07:30AM -0700, Ankur Arora wrote:
> Ankur Arora (5):
>   asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
>   arm64: barrier: Add smp_cond_load_relaxed_timewait()
>   arm64: rqspinlock: Remove private copy of
>     smp_cond_load_acquire_timewait
>   asm-generic: barrier: Add smp_cond_load_acquire_timewait()
>   rqspinlock: use smp_cond_load_acquire_timewait()

Can you have a go at poll_idle() to see how it would look like using
this API? It doesn't necessarily mean we have to merge them all at once
but it gives us a better idea of the suitability of the interface.

-- 
Catalin
Re: [PATCH v4 0/5] barrier: Add smp_cond_load_*_timewait()
Posted by Okanovic, Haris 4 weeks, 1 day ago
On Mon, 2025-09-01 at 12:49 +0100, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Fri, Aug 29, 2025 at 01:07:30AM -0700, Ankur Arora wrote:
> > Ankur Arora (5):
> >   asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
> >   arm64: barrier: Add smp_cond_load_relaxed_timewait()
> >   arm64: rqspinlock: Remove private copy of
> >     smp_cond_load_acquire_timewait
> >   asm-generic: barrier: Add smp_cond_load_acquire_timewait()
> >   rqspinlock: use smp_cond_load_acquire_timewait()
> 
> Can you have a go at poll_idle() to see how it would look like using
> this API? It doesn't necessarily mean we have to merge them all at once
> but it gives us a better idea of the suitability of the interface.

This is what I tested on ARM on Fri:

https://github.com/harisokanovic/linux/blob/37e02b950c99370466e7385e5e754bbd6232ef95/drivers/cpuidle/poll_state.c#L24

Regards,
Haris Okanovic
AWS Graviton Software

> 
> --
> Catalin

Re: [PATCH v4 0/5] barrier: Add smp_cond_load_*_timewait()
Posted by Ankur Arora 1 month ago
Catalin Marinas <catalin.marinas@arm.com> writes:

> On Fri, Aug 29, 2025 at 01:07:30AM -0700, Ankur Arora wrote:
>> Ankur Arora (5):
>>   asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
>>   arm64: barrier: Add smp_cond_load_relaxed_timewait()
>>   arm64: rqspinlock: Remove private copy of
>>     smp_cond_load_acquire_timewait
>>   asm-generic: barrier: Add smp_cond_load_acquire_timewait()
>>   rqspinlock: use smp_cond_load_acquire_timewait()
>
> Can you have a go at poll_idle() to see how it would look like using
> this API? It doesn't necessarily mean we have to merge them all at once
> but it gives us a better idea of the suitability of the interface.

So, I've been testing with some version of the following:

diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index 9b6d90a72601..361879396d0c 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -8,35 +8,25 @@
 #include <linux/sched/clock.h>
 #include <linux/sched/idle.h>

-#define POLL_IDLE_RELAX_COUNT	200
-
 static int __cpuidle poll_idle(struct cpuidle_device *dev,
 			       struct cpuidle_driver *drv, int index)
 {
-	u64 time_start;
-
-	time_start = local_clock_noinstr();
+	unsigned long flags;

 	dev->poll_time_limit = false;

 	raw_local_irq_enable();
 	if (!current_set_polling_and_test()) {
-		unsigned int loop_count = 0;
-		u64 limit;
+		u64 limit, time_end;

 		limit = cpuidle_poll_time(drv, dev);
+		time_end = local_clock_noinstr() + limit;

-		while (!need_resched()) {
-			cpu_relax();
-			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
-				continue;
+		flags = smp_cond_load_relaxed_timewait(&current_thread_info()->flags,
+						       VAL & _TIF_NEED_RESCHED,
+						       (local_clock_noinstr() >= time_end));

-			loop_count = 0;
-			if (local_clock_noinstr() - time_start > limit) {
-				dev->poll_time_limit = true;
-				break;
-			}
-		}
+		dev->poll_time_limit = (local_clock_noinstr() >= time_end);
 	}
 	raw_local_irq_disable();

With that, poll_idle() is:

  static int __cpuidle poll_idle(struct cpuidle_device *dev,
			       struct cpuidle_driver *drv, int index)
  {
	unsigned long flags;

	dev->poll_time_limit = false;

	raw_local_irq_enable();
	if (!current_set_polling_and_test()) {
		u64 limit, time_end;

		limit = cpuidle_poll_time(drv, dev);
		time_end = local_clock_noinstr() + limit;

		flags = smp_cond_load_relaxed_timewait(&current_thread_info()->flags,
						       VAL & _TIF_NEED_RESCHED,
						       (local_clock_noinstr() >= time_end));

		dev->poll_time_limit = (local_clock_noinstr() >= time_end);
	}
	raw_local_irq_disable();

	current_clr_polling();

	return index;
  }

--
ankur
Re: [PATCH v4 0/5] barrier: Add smp_cond_load_*_timewait()
Posted by Catalin Marinas 1 month ago
On Tue, Sep 02, 2025 at 03:46:52PM -0700, Ankur Arora wrote:
> Catalin Marinas <catalin.marinas@arm.com> writes:
> > Can you have a go at poll_idle() to see how it would look like using
> > this API? It doesn't necessarily mean we have to merge them all at once
> > but it gives us a better idea of the suitability of the interface.
> 
> So, I've been testing with some version of the following:
> 
> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
> index 9b6d90a72601..361879396d0c 100644
> --- a/drivers/cpuidle/poll_state.c
> +++ b/drivers/cpuidle/poll_state.c
> @@ -8,35 +8,25 @@
>  #include <linux/sched/clock.h>
>  #include <linux/sched/idle.h>
> 
> -#define POLL_IDLE_RELAX_COUNT	200
> -
>  static int __cpuidle poll_idle(struct cpuidle_device *dev,
>  			       struct cpuidle_driver *drv, int index)
>  {
> -	u64 time_start;
> -
> -	time_start = local_clock_noinstr();
> +	unsigned long flags;
> 
>  	dev->poll_time_limit = false;
> 
>  	raw_local_irq_enable();
>  	if (!current_set_polling_and_test()) {
> -		unsigned int loop_count = 0;
> -		u64 limit;
> +		u64 limit, time_end;
> 
>  		limit = cpuidle_poll_time(drv, dev);
> +		time_end = local_clock_noinstr() + limit;
> 
> -		while (!need_resched()) {
> -			cpu_relax();
> -			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
> -				continue;
> +		flags = smp_cond_load_relaxed_timewait(&current_thread_info()->flags,
> +						       VAL & _TIF_NEED_RESCHED,
> +						       (local_clock_noinstr() >= time_end));

It makes sense to have the non-strict comparison, though it changes the
original behaviour slightly. Just mention it in the commit log.

> 
> -			loop_count = 0;
> -			if (local_clock_noinstr() - time_start > limit) {
> -				dev->poll_time_limit = true;
> -				break;
> -			}
> -		}
> +		dev->poll_time_limit = (local_clock_noinstr() >= time_end);

Could we do this instead and avoid another clock read:

		dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED);

In the original code, it made sense since it had to check the clock
anyway and break the loop.

When you repost, please include the rqspinlock and poll_idle changes as
well to show how the interface is used.

-- 
Catalin
Re: [PATCH v4 0/5] barrier: Add smp_cond_load_*_timewait()
Posted by Ankur Arora 4 weeks, 1 day ago
Catalin Marinas <catalin.marinas@arm.com> writes:

> On Tue, Sep 02, 2025 at 03:46:52PM -0700, Ankur Arora wrote:
>> Catalin Marinas <catalin.marinas@arm.com> writes:
>> > Can you have a go at poll_idle() to see how it would look like using
>> > this API? It doesn't necessarily mean we have to merge them all at once
>> > but it gives us a better idea of the suitability of the interface.
>>
>> So, I've been testing with some version of the following:
>>
>> diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
>> index 9b6d90a72601..361879396d0c 100644
>> --- a/drivers/cpuidle/poll_state.c
>> +++ b/drivers/cpuidle/poll_state.c
>> @@ -8,35 +8,25 @@
>>  #include <linux/sched/clock.h>
>>  #include <linux/sched/idle.h>
>>
>> -#define POLL_IDLE_RELAX_COUNT	200
>> -
>>  static int __cpuidle poll_idle(struct cpuidle_device *dev,
>>  			       struct cpuidle_driver *drv, int index)
>>  {
>> -	u64 time_start;
>> -
>> -	time_start = local_clock_noinstr();
>> +	unsigned long flags;
>>
>>  	dev->poll_time_limit = false;
>>
>>  	raw_local_irq_enable();
>>  	if (!current_set_polling_and_test()) {
>> -		unsigned int loop_count = 0;
>> -		u64 limit;
>> +		u64 limit, time_end;
>>
>>  		limit = cpuidle_poll_time(drv, dev);
>> +		time_end = local_clock_noinstr() + limit;
>>
>> -		while (!need_resched()) {
>> -			cpu_relax();
>> -			if (loop_count++ < POLL_IDLE_RELAX_COUNT)
>> -				continue;
>> +		flags = smp_cond_load_relaxed_timewait(&current_thread_info()->flags,
>> +						       VAL & _TIF_NEED_RESCHED,
>> +						       (local_clock_noinstr() >= time_end));
>
> It makes sense to have the non-strict comparison, though it changes the
> original behaviour slightly. Just mention it in the commit log.
>
>>
>> -			loop_count = 0;
>> -			if (local_clock_noinstr() - time_start > limit) {
>> -				dev->poll_time_limit = true;
>> -				break;
>> -			}
>> -		}
>> +		dev->poll_time_limit = (local_clock_noinstr() >= time_end);
>
> Could we do this instead and avoid another clock read:
>
> 		dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED);
>
> In the original code, it made sense since it had to check the clock
> anyway and break the loop.
>
> When you repost, please include the rqspinlock and poll_idle changes as
> well to show how the interface is used.

Sure.

--
ankur
Re: [PATCH v4 0/5] barrier: Add smp_cond_load_*_timewait()
Posted by Okanovic, Haris 1 month ago
On Fri, 2025-08-29 at 01:07 -0700, Ankur Arora wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> Hi,
> 
> This series adds waited variants of the smp_cond_load() primitives:
> smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().
> 
> Why?: as the name suggests, the new interfaces are meant for contexts
> where you want to wait on a condition variable for a finite duration.
> This is easy enough to do with a loop around cpu_relax(). However,
> some architectures (ex. arm64) also allow waiting on a cacheline. So,
> these interfaces handle a mixture of spin/wait with a smp_cond_load()
> thrown in.
> 
> There are two known users for these interfaces:
> 
>  - poll_idle() [1]
>  - resilient queued spinlocks [2]
> 
> The interfaces are:
>    smp_cond_load_relaxed_timewait(ptr, cond_expr, time_check_expr)
>    smp_cond_load_acquire_spinwait(ptr, cond_expr, time_check_expr)
> 
> The added parameter, time_check_expr, determines the bail out condition.
> 
> Changelog:
>   v3 [3]:
>     - further interface simplifications (suggested by Catalin Marinas)
> 
>   v2 [4]:
>     - simplified the interface (suggested by Catalin Marinas)
>        - get rid of wait_policy, and a multitude of constants
>        - adds a slack parameter
>       This helped remove a fair amount of duplicated code duplication and in hindsight
>       unnecessary constants.
> 
>   v1 [5]:
>      - add wait_policy (coarse and fine)
>      - derive spin-count etc at runtime instead of using arbitrary
>        constants.
> 
> Haris Okanovic had tested an earlier version of this series with
> poll_idle()/haltpoll patches. [6]
> 
> Any comments appreciated!
> 
> Thanks!
> Ankur
> 
> [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
> [2] Uses the smp_cond_load_acquire_timewait() from v1
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h
> [3] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/
> [4] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/
> [5] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
> [6] https://lore.kernel.org/lkml/f2f5d09e79539754ced085ed89865787fa668695.camel@amazon.com
> 
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: linux-arch@vger.kernel.org
> 
> Ankur Arora (5):
>   asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
>   arm64: barrier: Add smp_cond_load_relaxed_timewait()
>   arm64: rqspinlock: Remove private copy of
>     smp_cond_load_acquire_timewait
>   asm-generic: barrier: Add smp_cond_load_acquire_timewait()
>   rqspinlock: use smp_cond_load_acquire_timewait()
> 
>  arch/arm64/include/asm/barrier.h    | 22 ++++++++
>  arch/arm64/include/asm/rqspinlock.h | 84 +----------------------------
>  include/asm-generic/barrier.h       | 57 ++++++++++++++++++++
>  include/asm-generic/rqspinlock.h    |  4 ++
>  kernel/bpf/rqspinlock.c             | 25 ++++-----
>  5 files changed, 93 insertions(+), 99 deletions(-)
> 
> --
> 2.31.1
> 

Tested on AWS Graviton 2, 3, and 4 (ARM64 Neoverse N1, V1, and V2) with
your V10 haltpoll changes, atop 6.17.0-rc3 (commit 07d9df8008).
Still seeing between 1.3x and 2.5x speedups in `perf bench sched pipe`
and `seccomp-notify`; no change in `messaging`.

Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Haris Okanovic <harisokn@amazon.com>

Regards,
Haris Okanovic
AWS Graviton Software

Re: [PATCH v4 0/5] barrier: Add smp_cond_load_*_timewait()
Posted by Ankur Arora 1 month ago
Okanovic, Haris <harisokn@amazon.com> writes:

> On Fri, 2025-08-29 at 01:07 -0700, Ankur Arora wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> Hi,
>>
>> This series adds waited variants of the smp_cond_load() primitives:
>> smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().
>>
>> Why?: as the name suggests, the new interfaces are meant for contexts
>> where you want to wait on a condition variable for a finite duration.
>> This is easy enough to do with a loop around cpu_relax(). However,
>> some architectures (ex. arm64) also allow waiting on a cacheline. So,
>> these interfaces handle a mixture of spin/wait with a smp_cond_load()
>> thrown in.
>>
>> There are two known users for these interfaces:
>>
>>  - poll_idle() [1]
>>  - resilient queued spinlocks [2]
>>
>> The interfaces are:
>>    smp_cond_load_relaxed_timewait(ptr, cond_expr, time_check_expr)
>>    smp_cond_load_acquire_spinwait(ptr, cond_expr, time_check_expr)
>>
>> The added parameter, time_check_expr, determines the bail out condition.
>>
>> Changelog:
>>   v3 [3]:
>>     - further interface simplifications (suggested by Catalin Marinas)
>>
>>   v2 [4]:
>>     - simplified the interface (suggested by Catalin Marinas)
>>        - get rid of wait_policy, and a multitude of constants
>>        - adds a slack parameter
>>       This helped remove a fair amount of duplicated code duplication and in hindsight
>>       unnecessary constants.
>>
>>   v1 [5]:
>>      - add wait_policy (coarse and fine)
>>      - derive spin-count etc at runtime instead of using arbitrary
>>        constants.
>>
>> Haris Okanovic had tested an earlier version of this series with
>> poll_idle()/haltpoll patches. [6]
>>
>> Any comments appreciated!
>>
>> Thanks!
>> Ankur
>>
>> [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
>> [2] Uses the smp_cond_load_acquire_timewait() from v1
>>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h
>> [3] https://lore.kernel.org/lkml/20250627044805.945491-1-ankur.a.arora@oracle.com/
>> [4] https://lore.kernel.org/lkml/20250502085223.1316925-1-ankur.a.arora@oracle.com/
>> [5] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
>> [6] https://lore.kernel.org/lkml/f2f5d09e79539754ced085ed89865787fa668695.camel@amazon.com
>>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: linux-arch@vger.kernel.org
>>
>> Ankur Arora (5):
>>   asm-generic: barrier: Add smp_cond_load_relaxed_timewait()
>>   arm64: barrier: Add smp_cond_load_relaxed_timewait()
>>   arm64: rqspinlock: Remove private copy of
>>     smp_cond_load_acquire_timewait
>>   asm-generic: barrier: Add smp_cond_load_acquire_timewait()
>>   rqspinlock: use smp_cond_load_acquire_timewait()
>>
>>  arch/arm64/include/asm/barrier.h    | 22 ++++++++
>>  arch/arm64/include/asm/rqspinlock.h | 84 +----------------------------
>>  include/asm-generic/barrier.h       | 57 ++++++++++++++++++++
>>  include/asm-generic/rqspinlock.h    |  4 ++
>>  kernel/bpf/rqspinlock.c             | 25 ++++-----
>>  5 files changed, 93 insertions(+), 99 deletions(-)
>>
>> --
>> 2.31.1
>>
>
> Tested on AWS Graviton 2, 3, and 4 (ARM64 Neoverse N1, V1, and V2) with
> your V10 haltpoll changes, atop 6.17.0-rc3 (commit 07d9df8008).
> Still seeing between 1.3x and 2.5x speedups in `perf bench sched pipe`
> and `seccomp-notify`; no change in `messaging`.

Great.

> Reviewed-by: Haris Okanovic <harisokn@amazon.com>
> Tested-by: Haris Okanovic <harisokn@amazon.com>

Thank you.

--
ankur