barrier: introduce smp_cond_load_*_timewait()

[PATCH v2 0/7] barrier: introduce smp_cond_load_*_timewait()

Posted by Ankur Arora 9 months, 1 week ago

Hi,

This series adds waited variants of the smp_cond_load() primitives:
smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().

There are two known users for these interfaces:

 - poll_idle() [1]
 - resilient queued spinlocks [2]

For both of these cases we want to wait on a condition but also want
to terminate the wait based on a timeout.

Before describing how v2 implements these interfaces, let me recap the
problems in v1 (Catalin outlined most of these in [3]):

smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, time_limit_ns)
took four arguments, with ptr and cond_expr doing the usual smp_cond_load()
things and time_expr_ns and time_limit_ns being used to decide the
terminating condition.

There were some problems in the timekeeping:

1. How often do we do the (relatively expensive) time-check?

   The choice made was once very 200 spin-wait iterations, with each
   iteration trying to idle the pipeline by executing cpu_relax().

   The choice of 200 was, of course, arbitrary and somewhat meaningless
   across architectures. On recent x86, cpu_relax()/PAUSE takes ~20-30
   cycles, but on (non-SMT) arm64 cpu_relax()/YIELD is effectively
   just a NOP.

   Even if each architecture had its own limit, this will also vary
   across microarchitectures.

2. On arm64, which can do better than just cpu_relax(), for instance,
   by waiting for a store on an address (WFE), the implementation
   exclusively used WFE, with the spin-wait only used as a fallback
   for when the event-stream was disabled.

   One problem with this was that the worst case timeout overshoot
   with WFE is ARCH_TIMER_EVT_STREAM_PERIOD_US (100us) and so there's
   a vast gulf between that and a potentially much smaller granularity
   with the spin-wait versions. In addition the interface provided
   no way for the caller to specify or limit the oveshoot.

Non-timekeeping issues:

3. The interface was useful for poll_idle() like users but was not
   usable if the caller needed to do any work. For instance,
   rqspinlock uses it thus:

     smp_cond_load_acquire_timewait(v, c, 0, 1)

   Here the time-check always evaluates to false and all of the logic
   (ex. deadlock checking) is folded into the conditional.


With that foundation, the new interface is:

   smp_cond_load_relaxed_timewait(ptr, cond_expr, wait_policy,
					 time_expr, time_end)

The added parameter, wait_policy provides a mechanism for the caller
to apportion time spent spinning or, where supported, in a wait.
This is somewhat inspired from the queue_poll() mechanism used
with smp_cond_load() in arm-smmu-v3 [4].

It addresses (1) by deciding the time-check granularity based on a
time interval instead of spinning for a fixed number of iterations.

(2) is addressed by the wait_policy allowing for different slack
values. The implemented versions of wait_policy allow for a coarse
or a fine grained slack. A user defined wait_policy could choose
its own wait parameter. This would also address (3).


With that, patches 1-5, add the generic and arm64 logic:

  "asm-generic: barrier: add smp_cond_load_relaxed_timewait()",
  "asm-generic: barrier: add wait_policy handlers"

  "arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait()"
  "arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait()"
  "arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait()".

And, patch 6, adds the acquire variant:

  "asm-generic: barrier: add smp_cond_load_acquire_timewait()"

And, finally patch 7 lays out how this could be used for rqspinlock:

  "bpf: rqspinlock: add rqspinlock policy handler for arm64".

Any comments appreciated!

Ankur


[1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
[2] Uses the smp_cond_load_acquire_timewait() from v1
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h
[3] https://lore.kernel.org/lkml/Z8dRalfxYcJIcLGj@arm.com/
[4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#n223


Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: linux-arch@vger.kernel.org


Ankur Arora (7):
  asm-generic: barrier: add smp_cond_load_relaxed_timewait()
  asm-generic: barrier: add wait_policy handlers
  arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait()
  arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait()
  arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait()
  asm-generic: barrier: add smp_cond_load_acquire_timewait()
  bpf: rqspinlock: add rqspinlock policy handler for arm64

 arch/arm64/include/asm/barrier.h    |  82 +++++++++++++++
 arch/arm64/include/asm/rqspinlock.h |  96 ++++--------------
 include/asm-generic/barrier.h       | 150 ++++++++++++++++++++++++++++
 3 files changed, 251 insertions(+), 77 deletions(-)

-- 
2.43.5

Re: [PATCH v2 0/7] barrier: introduce smp_cond_load_*_timewait()

Posted by Okanovic, Haris 8 months, 3 weeks ago

On Fri, 2025-05-02 at 01:52 -0700, Ankur Arora wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> Hi,
> 
> This series adds waited variants of the smp_cond_load() primitives:
> smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().
> 
> There are two known users for these interfaces:
> 
>  - poll_idle() [1]
>  - resilient queued spinlocks [2]
> 
> For both of these cases we want to wait on a condition but also want
> to terminate the wait based on a timeout.
> 
> Before describing how v2 implements these interfaces, let me recap the
> problems in v1 (Catalin outlined most of these in [3]):
> 
> smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, time_limit_ns)
> took four arguments, with ptr and cond_expr doing the usual smp_cond_load()
> things and time_expr_ns and time_limit_ns being used to decide the
> terminating condition.
> 
> There were some problems in the timekeeping:
> 
> 1. How often do we do the (relatively expensive) time-check?
> 
>    The choice made was once very 200 spin-wait iterations, with each
>    iteration trying to idle the pipeline by executing cpu_relax().
> 
>    The choice of 200 was, of course, arbitrary and somewhat meaningless
>    across architectures. On recent x86, cpu_relax()/PAUSE takes ~20-30
>    cycles, but on (non-SMT) arm64 cpu_relax()/YIELD is effectively
>    just a NOP.
> 
>    Even if each architecture had its own limit, this will also vary
>    across microarchitectures.
> 
> 2. On arm64, which can do better than just cpu_relax(), for instance,
>    by waiting for a store on an address (WFE), the implementation
>    exclusively used WFE, with the spin-wait only used as a fallback
>    for when the event-stream was disabled.
> 
>    One problem with this was that the worst case timeout overshoot
>    with WFE is ARCH_TIMER_EVT_STREAM_PERIOD_US (100us) and so there's
>    a vast gulf between that and a potentially much smaller granularity
>    with the spin-wait versions. In addition the interface provided
>    no way for the caller to specify or limit the oveshoot.
> 
> Non-timekeeping issues:
> 
> 3. The interface was useful for poll_idle() like users but was not
>    usable if the caller needed to do any work. For instance,
>    rqspinlock uses it thus:
> 
>      smp_cond_load_acquire_timewait(v, c, 0, 1)
> 
>    Here the time-check always evaluates to false and all of the logic
>    (ex. deadlock checking) is folded into the conditional.
> 
> 
> With that foundation, the new interface is:
> 
>    smp_cond_load_relaxed_timewait(ptr, cond_expr, wait_policy,
>                                          time_expr, time_end)
> 
> The added parameter, wait_policy provides a mechanism for the caller
> to apportion time spent spinning or, where supported, in a wait.
> This is somewhat inspired from the queue_poll() mechanism used
> with smp_cond_load() in arm-smmu-v3 [4].
> 
> It addresses (1) by deciding the time-check granularity based on a
> time interval instead of spinning for a fixed number of iterations.
> 
> (2) is addressed by the wait_policy allowing for different slack
> values. The implemented versions of wait_policy allow for a coarse
> or a fine grained slack. A user defined wait_policy could choose
> its own wait parameter. This would also address (3).
> 
> 
> With that, patches 1-5, add the generic and arm64 logic:
> 
>   "asm-generic: barrier: add smp_cond_load_relaxed_timewait()",
>   "asm-generic: barrier: add wait_policy handlers"
> 
>   "arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait()"
>   "arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait()"
>   "arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait()".
> 
> And, patch 6, adds the acquire variant:
> 
>   "asm-generic: barrier: add smp_cond_load_acquire_timewait()"
> 
> And, finally patch 7 lays out how this could be used for rqspinlock:
> 
>   "bpf: rqspinlock: add rqspinlock policy handler for arm64".
> 
> Any comments appreciated!
> 
> Ankur
> 
> 
> [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
> [2] Uses the smp_cond_load_acquire_timewait() from v1
>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h
> [3] https://lore.kernel.org/lkml/Z8dRalfxYcJIcLGj@arm.com/
> [4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#n223
> 
> 
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: linux-arch@vger.kernel.org
> 
> 
> Ankur Arora (7):
>   asm-generic: barrier: add smp_cond_load_relaxed_timewait()
>   asm-generic: barrier: add wait_policy handlers
>   arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait()
>   arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait()
>   arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait()
>   asm-generic: barrier: add smp_cond_load_acquire_timewait()
>   bpf: rqspinlock: add rqspinlock policy handler for arm64
> 
>  arch/arm64/include/asm/barrier.h    |  82 +++++++++++++++
>  arch/arm64/include/asm/rqspinlock.h |  96 ++++--------------
>  include/asm-generic/barrier.h       | 150 ++++++++++++++++++++++++++++
>  3 files changed, 251 insertions(+), 77 deletions(-)
> 
> --
> 2.43.5
> 

Tested on AWS Graviton (ARM64 Neoverse V1) with your V10 haltpoll
changes, atop master 83a896549f.

Reviewed-by: Haris Okanovic <harisokn@amazon.com>
Tested-by: Haris Okanovic <harisokn@amazon.com>

Regards,
Haris Okanovic
AWS Graviton Software

Re: [PATCH v2 0/7] barrier: introduce smp_cond_load_*_timewait()

Posted by Ankur Arora 8 months, 3 weeks ago

Okanovic, Haris <harisokn@amazon.com> writes:

> On Fri, 2025-05-02 at 01:52 -0700, Ankur Arora wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> Hi,
>>
>> This series adds waited variants of the smp_cond_load() primitives:
>> smp_cond_load_relaxed_timewait(), and smp_cond_load_acquire_timewait().
>>
>> There are two known users for these interfaces:
>>
>>  - poll_idle() [1]
>>  - resilient queued spinlocks [2]
>>
>> For both of these cases we want to wait on a condition but also want
>> to terminate the wait based on a timeout.
>>
>> Before describing how v2 implements these interfaces, let me recap the
>> problems in v1 (Catalin outlined most of these in [3]):
>>
>> smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, time_limit_ns)
>> took four arguments, with ptr and cond_expr doing the usual smp_cond_load()
>> things and time_expr_ns and time_limit_ns being used to decide the
>> terminating condition.
>>
>> There were some problems in the timekeeping:
>>
>> 1. How often do we do the (relatively expensive) time-check?
>>
>>    The choice made was once very 200 spin-wait iterations, with each
>>    iteration trying to idle the pipeline by executing cpu_relax().
>>
>>    The choice of 200 was, of course, arbitrary and somewhat meaningless
>>    across architectures. On recent x86, cpu_relax()/PAUSE takes ~20-30
>>    cycles, but on (non-SMT) arm64 cpu_relax()/YIELD is effectively
>>    just a NOP.
>>
>>    Even if each architecture had its own limit, this will also vary
>>    across microarchitectures.
>>
>> 2. On arm64, which can do better than just cpu_relax(), for instance,
>>    by waiting for a store on an address (WFE), the implementation
>>    exclusively used WFE, with the spin-wait only used as a fallback
>>    for when the event-stream was disabled.
>>
>>    One problem with this was that the worst case timeout overshoot
>>    with WFE is ARCH_TIMER_EVT_STREAM_PERIOD_US (100us) and so there's
>>    a vast gulf between that and a potentially much smaller granularity
>>    with the spin-wait versions. In addition the interface provided
>>    no way for the caller to specify or limit the oveshoot.
>>
>> Non-timekeeping issues:
>>
>> 3. The interface was useful for poll_idle() like users but was not
>>    usable if the caller needed to do any work. For instance,
>>    rqspinlock uses it thus:
>>
>>      smp_cond_load_acquire_timewait(v, c, 0, 1)
>>
>>    Here the time-check always evaluates to false and all of the logic
>>    (ex. deadlock checking) is folded into the conditional.
>>
>>
>> With that foundation, the new interface is:
>>
>>    smp_cond_load_relaxed_timewait(ptr, cond_expr, wait_policy,
>>                                          time_expr, time_end)
>>
>> The added parameter, wait_policy provides a mechanism for the caller
>> to apportion time spent spinning or, where supported, in a wait.
>> This is somewhat inspired from the queue_poll() mechanism used
>> with smp_cond_load() in arm-smmu-v3 [4].
>>
>> It addresses (1) by deciding the time-check granularity based on a
>> time interval instead of spinning for a fixed number of iterations.
>>
>> (2) is addressed by the wait_policy allowing for different slack
>> values. The implemented versions of wait_policy allow for a coarse
>> or a fine grained slack. A user defined wait_policy could choose
>> its own wait parameter. This would also address (3).
>>
>>
>> With that, patches 1-5, add the generic and arm64 logic:
>>
>>   "asm-generic: barrier: add smp_cond_load_relaxed_timewait()",
>>   "asm-generic: barrier: add wait_policy handlers"
>>
>>   "arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait()"
>>   "arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait()"
>>   "arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait()".
>>
>> And, patch 6, adds the acquire variant:
>>
>>   "asm-generic: barrier: add smp_cond_load_acquire_timewait()"
>>
>> And, finally patch 7 lays out how this could be used for rqspinlock:
>>
>>   "bpf: rqspinlock: add rqspinlock policy handler for arm64".
>>
>> Any comments appreciated!
>>
>> Ankur
>>
>>
>> [1] https://lore.kernel.org/lkml/20241107190818.522639-3-ankur.a.arora@oracle.com/
>> [2] Uses the smp_cond_load_acquire_timewait() from v1
>>     https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/rqspinlock.h
>> [3] https://lore.kernel.org/lkml/Z8dRalfxYcJIcLGj@arm.com/
>> [4] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#n223
>>
>>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> Cc: Alexei Starovoitov <ast@kernel.org>
>> Cc: linux-arch@vger.kernel.org
>>
>>
>> Ankur Arora (7):
>>   asm-generic: barrier: add smp_cond_load_relaxed_timewait()
>>   asm-generic: barrier: add wait_policy handlers
>>   arm64: barrier: enable waiting in smp_cond_load_relaxed_timewait()
>>   arm64: barrier: add coarse wait for smp_cond_load_relaxed_timewait()
>>   arm64: barrier: add fine wait for smp_cond_load_relaxed_timewait()
>>   asm-generic: barrier: add smp_cond_load_acquire_timewait()
>>   bpf: rqspinlock: add rqspinlock policy handler for arm64
>>
>>  arch/arm64/include/asm/barrier.h    |  82 +++++++++++++++
>>  arch/arm64/include/asm/rqspinlock.h |  96 ++++--------------
>>  include/asm-generic/barrier.h       | 150 ++++++++++++++++++++++++++++
>>  3 files changed, 251 insertions(+), 77 deletions(-)
>>
>> --
>> 2.43.5
>>
>
> Tested on AWS Graviton (ARM64 Neoverse V1) with your V10 haltpoll
> changes, atop master 83a896549f.
>
> Reviewed-by: Haris Okanovic <harisokn@amazon.com>
> Tested-by: Haris Okanovic <harisokn@amazon.com>

Thanks for the review (and the testing)!

--
ankur

Re: [PATCH v2 0/7] barrier: introduce smp_cond_load_*_timewait()

Posted by Christoph Lameter (Ampere) 9 months, 1 week ago

On Fri, 2 May 2025, Ankur Arora wrote:

> smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, time_limit_ns)
> took four arguments, with ptr and cond_expr doing the usual smp_cond_load()
> things and time_expr_ns and time_limit_ns being used to decide the
> terminating condition.
>
> There were some problems in the timekeeping:
>
> 1. How often do we do the (relatively expensive) time-check?

Is this really important? We have instructions that wait on an event and
terminate at cycle counter values like WFET on arm64

The case were we need to perform time checks is only needed if the
processor does not support WFET but must use a event stream or does not
even have that available.

So the best approach is to have a simple interface were we specify the
cycle count when the wait is to be terminated and where we can cover that
with one WFET instruction.

The other cases then are degenerate forms of that. If only WFE is
available then only use that if the timeout is larger than the event
stream granularity. Or if both are not available them do the relax /
loop thing.

So the interface could be much simpler:

   __smp_cond_load_relaxed_wait(ptr, timeout_cycle_count)

with a wrapper

   smp_cond_relaxed_wait_expr(ptr, expr, timeout cycle count)

where we check the expression too and retry if the expression is not true.

The fallbacks with the spins and relax logic would be architecture
specific.

Re: [PATCH v2 0/7] barrier: introduce smp_cond_load_*_timewait()

Posted by Ankur Arora 9 months, 1 week ago

Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Fri, 2 May 2025, Ankur Arora wrote:
>
>> smp_cond_load_relaxed_spinwait(ptr, cond_expr, time_expr_ns, time_limit_ns)
>> took four arguments, with ptr and cond_expr doing the usual smp_cond_load()
>> things and time_expr_ns and time_limit_ns being used to decide the
>> terminating condition.
>>
>> There were some problems in the timekeeping:
>>
>> 1. How often do we do the (relatively expensive) time-check?
>
> Is this really important? We have instructions that wait on an event and
> terminate at cycle counter values like WFET on arm64
>
> The case were we need to perform time checks is only needed if the
> processor does not support WFET but must use a event stream or does not
> even have that available.

AFAICT the vast majority of arm64 processors in the wild don't yet
support WFET. For instance I haven't been able to find a single one
to test my WFET changes with ;).

The other part is that this needs to be in common code and x86 primarily
uses PAUSE.

So, supporting both configurations: WFE + spin on arm64 and PAUSE on x86
needs a way of rate-limiting the time-check. Otherwise you run into
issues like this one:

  commit 4dc2375c1a4e88ed2701f6961e0e4f9a7696ad3c
  Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
  Date:   Tue Mar 27 23:58:45 2018 +0200

      cpuidle: poll_state: Avoid invoking local_clock() too often

      Rik reports that he sees an increase in CPU use in one benchmark
      due to commit 612f1a22f067 "cpuidle: poll_state: Add time limit to
      poll_idle()" that caused poll_idle() to call local_clock() in every
      iteration of the loop.  Utilization increase generally means more
      non-idle time with respect to total CPU time (on the average) which
      implies reduced CPU frequency.

      Doug reports that limiting the rate of local_clock() invocations
      in there causes much less power to be drawn during a CPU-intensive
      parallel workload (with idle states 1 and 2 disabled to enforce more
      state 0 residency).

      These two reports together suggest that executing local_clock() on
      multiple CPUs in parallel at a high rate may cause chips to get hot
      and trigger thermal/power limits on them to kick in, so reduce the
      rate of local_clock() invocations in poll_idle() to avoid that issue.

> So the best approach is to have a simple interface were we specify the
> cycle count when the wait is to be terminated and where we can cover that
> with one WFET instruction.
>
> The other cases then are degenerate forms of that. If only WFE is
> available then only use that if the timeout is larger than the event
> stream granularity. Or if both are not available them do the relax /
> loop thing.

That's what I had proposed for v1. But as Catalin pointed out, that's
not very useful when the caller wants to limit the overshoot.

This version tries to optimistically use WFE where possible while
minimizing the spin time.

> So the interface could be much simpler:
>
>    __smp_cond_load_relaxed_wait(ptr, timeout_cycle_count)
>
> with a wrapper
>
>    smp_cond_relaxed_wait_expr(ptr, expr, timeout cycle count)

Oh, I would have absolutely liked to keep the interface simple, but
couldn't see a way to do that while managing the other constraints.

For instance, different users want different clocks: poll_idle() can do
with an imprecise clock but rqspinlock needs ktime_get_mono().

I think using standard clock types is also better instead of using
arm64 specific cycles or tsc or whatever.

> where we check the expression too and retry if the expression is not true.
>
> The fallbacks with the spins and relax logic would be architecture
> specific.

Even if they were all architecture specific, I suspect there's quite a
lot of variation between cpu_relax() across microarchitectures. For
instance YIELD is a nop on non-SMT arm64, but probably heavier on
SMT arm64.

Thanks for the quick review!
Ankur

Re: [PATCH v2 0/7] barrier: introduce smp_cond_load_*_timewait()

Posted by Christoph Lameter (Ampere) 9 months, 1 week ago

On Fri, 2 May 2025, Ankur Arora wrote:

> AFAICT the vast majority of arm64 processors in the wild don't yet
> support WFET. For instance I haven't been able to find a single one
> to test my WFET changes with ;).

Ok then for patch 1-6:

Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>

Re: [PATCH v2 0/7] barrier: introduce smp_cond_load_*_timewait()

Posted by Ankur Arora 9 months, 1 week ago

Christoph Lameter (Ampere) <cl@gentwo.org> writes:

> On Fri, 2 May 2025, Ankur Arora wrote:
>
>> AFAICT the vast majority of arm64 processors in the wild don't yet
>> support WFET. For instance I haven't been able to find a single one
>> to test my WFET changes with ;).
>
> Ok then for patch 1-6:
>
> Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org>

Great. Thanks Christoph!

--
ankur