arch/Kconfig | 3 +++ arch/arm64/Kconfig | 7 +++++++ arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++ arch/arm64/include/asm/thread_info.h | 2 ++ arch/arm64/kernel/idle.c | 1 + arch/x86/Kconfig | 5 ++--- arch/x86/include/asm/cpuidle_haltpoll.h | 1 + arch/x86/kernel/kvm.c | 13 ++++++++++++ drivers/acpi/processor_idle.c | 4 ++-- drivers/cpuidle/Kconfig | 5 ++--- drivers/cpuidle/Makefile | 2 +- drivers/cpuidle/cpuidle-haltpoll.c | 12 +----------- drivers/cpuidle/governors/haltpoll.c | 6 +----- drivers/cpuidle/poll_state.c | 22 +++++++++++++++------ drivers/idle/Kconfig | 1 + include/linux/cpuidle.h | 2 +- include/linux/cpuidle_haltpoll.h | 5 +++++ 17 files changed, 83 insertions(+), 32 deletions(-) create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
This patchset enables the cpuidle-haltpoll driver and its namesake
governor on arm64. This is specifically interesting for KVM guests by
reducing IPC latencies.
Comparing idle switching latencies on an arm64 KVM guest with
perf bench sched pipe:
usecs/op %stdev
no haltpoll (baseline) 13.48 +- 5.19%
with haltpoll 6.84 +- 22.07%
No change in performance for a similar test on x86:
usecs/op %stdev
haltpoll w/ cpu_relax() (baseline) 4.75 +- 1.76%
haltpoll w/ smp_cond_load_relaxed() 4.78 +- 2.31%
Both sets of tests were on otherwise idle systems with guest VCPUs
pinned to specific PCPUs. One reason for the higher stdev on arm64
is that trapping of the WFE instruction by the host KVM is contingent
on the number of tasks on the runqueue.
Tomohiro Misono and Haris Okanovic also report similar latency
improvements on Grace and Graviton systems (for v7) [1] [2].
The patch series is organized in three parts:
- patch 1, reorganizes the poll_idle() loop, switching to
smp_cond_load_relaxed() in the polling loop.
Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
renaming it to ARCH_HAS_OPTIMIZED_POLL.
- patches 4-6 reorganize the haltpoll selection and init logic
to allow architecture code to select it.
- and finally, patches 7-11 add the bits for arm64 support.
What is still missing: this series largely completes the haltpoll side
of functionality for arm64. There are, however, a few related areas
that still need to be threshed out:
- WFET support: WFE on arm64 does not guarantee that poll_idle()
would terminate in halt_poll_ns. Using WFET would address this.
- KVM_NO_POLL support on arm64
- KVM TWED support on arm64: allow the host to limit time spent in
WFE.
Changelog:
v8: No logic changes. Largely respin of v7, with changes
noted below:
- move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
own patch.
(patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
- address comments simplifying arm64 support (Will Deacon)
(patch-11 "arm64: support cpuidle-haltpoll")
v7: No significant logic changes. Mostly a respin of v6.
- minor cleanup in poll_idle() (Christoph Lameter)
- fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
(Tomohiro Misono)
v6:
- reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
changes together (comment from Christoph Lameter)
- threshes out the commit messages a bit more (comments from Christoph
Lameter, Sudeep Holla)
- also rework selection of cpuidle-haltpoll. Now selected based
on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
- moved back to arch_haltpoll_want() (comment from Joao Martins)
Also, arch_haltpoll_want() now takes the force parameter and is
now responsible for the complete selection (or not) of haltpoll.
- fixes the build breakage on i386
- fixes the cpuidle-haltpoll module breakage on arm64 (comment from
Tomohiro Misono, Haris Okanovic)
v5:
- rework the poll_idle() loop around smp_cond_load_relaxed() (review
comment from Tomohiro Misono.)
- also rework selection of cpuidle-haltpoll. Now selected based
on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
- arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
arm64 now depends on the event-stream being enabled.
- limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
- ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
v4 changes from v3:
- change 7/8 per Rafael input: drop the parens and use ret for the final check
- add 8/8 which renames the guard for building poll_state
v3 changes from v2:
- fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
- add Ack-by from Rafael Wysocki on 2/7
v2 changes from v1:
- added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
(this improves by 50% at least the CPU cycles consumed in the tests above:
10,716,881,137 now vs 14,503,014,257 before)
- removed the ifdef from patch 1 per RafaelW
Please review.
[1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
[2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
Ankur Arora (6):
cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
arm64: idle: export arch_cpu_idle
arm64: select ARCH_HAS_OPTIMIZED_POLL
cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
arm64: support cpuidle-haltpoll
Joao Martins (4):
Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
cpuidle-haltpoll: define arch_haltpoll_want()
governors/haltpoll: drop kvm_para_available() check
arm64: define TIF_POLLING_NRFLAG
Mihai Carabas (1):
cpuidle/poll_state: poll via smp_cond_load_relaxed()
arch/Kconfig | 3 +++
arch/arm64/Kconfig | 7 +++++++
arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
arch/arm64/include/asm/thread_info.h | 2 ++
arch/arm64/kernel/idle.c | 1 +
arch/x86/Kconfig | 5 ++---
arch/x86/include/asm/cpuidle_haltpoll.h | 1 +
arch/x86/kernel/kvm.c | 13 ++++++++++++
drivers/acpi/processor_idle.c | 4 ++--
drivers/cpuidle/Kconfig | 5 ++---
drivers/cpuidle/Makefile | 2 +-
drivers/cpuidle/cpuidle-haltpoll.c | 12 +-----------
drivers/cpuidle/governors/haltpoll.c | 6 +-----
drivers/cpuidle/poll_state.c | 22 +++++++++++++++------
drivers/idle/Kconfig | 1 +
include/linux/cpuidle.h | 2 +-
include/linux/cpuidle_haltpoll.h | 5 +++++
17 files changed, 83 insertions(+), 32 deletions(-)
create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
--
2.43.5
On Thu, 26 Sep 2024 00:24:14 +0100, Ankur Arora <ankur.a.arora@oracle.com> wrote: > > This patchset enables the cpuidle-haltpoll driver and its namesake > governor on arm64. This is specifically interesting for KVM guests by > reducing IPC latencies. > > Comparing idle switching latencies on an arm64 KVM guest with > perf bench sched pipe: > > usecs/op %stdev > > no haltpoll (baseline) 13.48 +- 5.19% > with haltpoll 6.84 +- 22.07% > > > No change in performance for a similar test on x86: > > usecs/op %stdev > > haltpoll w/ cpu_relax() (baseline) 4.75 +- 1.76% > haltpoll w/ smp_cond_load_relaxed() 4.78 +- 2.31% > > Both sets of tests were on otherwise idle systems with guest VCPUs > pinned to specific PCPUs. One reason for the higher stdev on arm64 > is that trapping of the WFE instruction by the host KVM is contingent > on the number of tasks on the runqueue. Sorry to state the obvious, but if that's the variable trapping of WFI/WFE is the cause of your trouble, why don't you simply turn it off (see 0b5afe05377d for the details)? Given that you pin your vcpus to physical CPUs, there is no need for any trapping. M. -- Without deviation from the norm, progress is not possible.
Marc Zyngier <maz@kernel.org> writes: > On Thu, 26 Sep 2024 00:24:14 +0100, > Ankur Arora <ankur.a.arora@oracle.com> wrote: >> >> This patchset enables the cpuidle-haltpoll driver and its namesake >> governor on arm64. This is specifically interesting for KVM guests by >> reducing IPC latencies. >> >> Comparing idle switching latencies on an arm64 KVM guest with >> perf bench sched pipe: >> >> usecs/op %stdev >> >> no haltpoll (baseline) 13.48 +- 5.19% >> with haltpoll 6.84 +- 22.07% >> >> >> No change in performance for a similar test on x86: >> >> usecs/op %stdev >> >> haltpoll w/ cpu_relax() (baseline) 4.75 +- 1.76% >> haltpoll w/ smp_cond_load_relaxed() 4.78 +- 2.31% >> >> Both sets of tests were on otherwise idle systems with guest VCPUs >> pinned to specific PCPUs. One reason for the higher stdev on arm64 >> is that trapping of the WFE instruction by the host KVM is contingent >> on the number of tasks on the runqueue. > > Sorry to state the obvious, but if that's the variable trapping of > WFI/WFE is the cause of your trouble, why don't you simply turn it off > (see 0b5afe05377d for the details)? Given that you pin your vcpus to > physical CPUs, there is no need for any trapping. Good point. Thanks. That should help reduce the guessing games around the variance in these tests. -- ankur
On Wed, 16 Oct 2024 22:55:09 +0100, Ankur Arora <ankur.a.arora@oracle.com> wrote: > > > Marc Zyngier <maz@kernel.org> writes: > > > On Thu, 26 Sep 2024 00:24:14 +0100, > > Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> > >> This patchset enables the cpuidle-haltpoll driver and its namesake > >> governor on arm64. This is specifically interesting for KVM guests by > >> reducing IPC latencies. > >> > >> Comparing idle switching latencies on an arm64 KVM guest with > >> perf bench sched pipe: > >> > >> usecs/op %stdev > >> > >> no haltpoll (baseline) 13.48 +- 5.19% > >> with haltpoll 6.84 +- 22.07% > >> > >> > >> No change in performance for a similar test on x86: > >> > >> usecs/op %stdev > >> > >> haltpoll w/ cpu_relax() (baseline) 4.75 +- 1.76% > >> haltpoll w/ smp_cond_load_relaxed() 4.78 +- 2.31% > >> > >> Both sets of tests were on otherwise idle systems with guest VCPUs > >> pinned to specific PCPUs. One reason for the higher stdev on arm64 > >> is that trapping of the WFE instruction by the host KVM is contingent > >> on the number of tasks on the runqueue. > > > > Sorry to state the obvious, but if that's the variable trapping of > > WFI/WFE is the cause of your trouble, why don't you simply turn it off > > (see 0b5afe05377d for the details)? Given that you pin your vcpus to > > physical CPUs, there is no need for any trapping. > > Good point. Thanks. That should help reduce the guessing games around > the variance in these tests. I'd be interested to find out whether there is still some benefit in this series once you disable the WFx trapping heuristics. Thanks, M. -- Without deviation from the norm, progress is not possible.
Marc Zyngier <maz@kernel.org> writes: > On Wed, 16 Oct 2024 22:55:09 +0100, > Ankur Arora <ankur.a.arora@oracle.com> wrote: >> >> >> Marc Zyngier <maz@kernel.org> writes: >> >> > On Thu, 26 Sep 2024 00:24:14 +0100, >> > Ankur Arora <ankur.a.arora@oracle.com> wrote: >> >> >> >> This patchset enables the cpuidle-haltpoll driver and its namesake >> >> governor on arm64. This is specifically interesting for KVM guests by >> >> reducing IPC latencies. >> >> >> >> Comparing idle switching latencies on an arm64 KVM guest with >> >> perf bench sched pipe: >> >> >> >> usecs/op %stdev >> >> >> >> no haltpoll (baseline) 13.48 +- 5.19% >> >> with haltpoll 6.84 +- 22.07% >> >> >> >> >> >> No change in performance for a similar test on x86: >> >> >> >> usecs/op %stdev >> >> >> >> haltpoll w/ cpu_relax() (baseline) 4.75 +- 1.76% >> >> haltpoll w/ smp_cond_load_relaxed() 4.78 +- 2.31% >> >> >> >> Both sets of tests were on otherwise idle systems with guest VCPUs >> >> pinned to specific PCPUs. One reason for the higher stdev on arm64 >> >> is that trapping of the WFE instruction by the host KVM is contingent >> >> on the number of tasks on the runqueue. >> > >> > Sorry to state the obvious, but if that's the variable trapping of >> > WFI/WFE is the cause of your trouble, why don't you simply turn it off >> > (see 0b5afe05377d for the details)? Given that you pin your vcpus to >> > physical CPUs, there is no need for any trapping. >> >> Good point. Thanks. That should help reduce the guessing games around >> the variance in these tests. > > I'd be interested to find out whether there is still some benefit in > this series once you disable the WFx trapping heuristics. The benefit of polling in idle is more than just avoiding the cost of trapping and re-entering. The other benefit is that remote wakeups can now be done just by setting need-resched, instead of sending an IPI, and incurring the cost of handling the interrupt on the receiver side. But let me get you some numbers with that. -- ankur
Ankur Arora <ankur.a.arora@oracle.com> writes:
> Marc Zyngier <maz@kernel.org> writes:
>
>> On Wed, 16 Oct 2024 22:55:09 +0100,
>> Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>>
>>>
>>> Marc Zyngier <maz@kernel.org> writes:
>>>
>>> > On Thu, 26 Sep 2024 00:24:14 +0100,
>>> > Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>> >>
>>> >> This patchset enables the cpuidle-haltpoll driver and its namesake
>>> >> governor on arm64. This is specifically interesting for KVM guests by
>>> >> reducing IPC latencies.
>>> >>
>>> >> Comparing idle switching latencies on an arm64 KVM guest with
>>> >> perf bench sched pipe:
>>> >>
>>> >> usecs/op %stdev
>>> >>
>>> >> no haltpoll (baseline) 13.48 +- 5.19%
>>> >> with haltpoll 6.84 +- 22.07%
>>> >>
>>> >>
>>> >> No change in performance for a similar test on x86:
>>> >>
>>> >> usecs/op %stdev
>>> >>
>>> >> haltpoll w/ cpu_relax() (baseline) 4.75 +- 1.76%
>>> >> haltpoll w/ smp_cond_load_relaxed() 4.78 +- 2.31%
>>> >>
>>> >> Both sets of tests were on otherwise idle systems with guest VCPUs
>>> >> pinned to specific PCPUs. One reason for the higher stdev on arm64
>>> >> is that trapping of the WFE instruction by the host KVM is contingent
>>> >> on the number of tasks on the runqueue.
>>> >
>>> > Sorry to state the obvious, but if that's the variable trapping of
>>> > WFI/WFE is the cause of your trouble, why don't you simply turn it off
>>> > (see 0b5afe05377d for the details)? Given that you pin your vcpus to
>>> > physical CPUs, there is no need for any trapping.
>>>
>>> Good point. Thanks. That should help reduce the guessing games around
>>> the variance in these tests.
>>
>> I'd be interested to find out whether there is still some benefit in
>> this series once you disable the WFx trapping heuristics.
>
> The benefit of polling in idle is more than just avoiding the cost of
> trapping and re-entering. The other benefit is that remote wakeups
> can now be done just by setting need-resched, instead of sending an
> IPI, and incurring the cost of handling the interrupt on the receiver
> side.
>
> But let me get you some numbers with that.
So, I ran the sched-pipe test with processes on VCPUs 4 and 5 with
kvm-arm.wfi_trap_policy=notrap.
# perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
perf bench sched pipe -l 1000000 -c 4
# No haltpoll (and, no TIF_POLLING_NRFLAG):
Performance counter stats for 'CPU(s) 4,5' (5 runs):
25,229.57 msec task-clock # 2.000 CPUs utilized ( +- 7.75% )
45,821,250,284 cycles # 1.816 GHz ( +- 10.07% )
26,557,496,665 instructions # 0.58 insn per cycle ( +- 0.21% )
0 sched:sched_wake_idle_without_ipi # 0.000 /sec
12.615 +- 0.977 seconds time elapsed ( +- 7.75% )
# Haltpoll:
Performance counter stats for 'CPU(s) 4,5' (5 runs):
15,131.58 msec task-clock # 2.000 CPUs utilized ( +- 10.00% )
34,158,188,839 cycles # 2.257 GHz ( +- 6.91% )
20,824,950,916 instructions # 0.61 insn per cycle ( +- 0.09% )
1,983,822 sched:sched_wake_idle_without_ipi # 131.105 K/sec ( +- 0.78% )
7.566 +- 0.756 seconds time elapsed ( +- 10.00% )
We get a decent boost just because we are executing ~20% fewer
instructions. Not sure how the cpu frequency scaling works in a
VM but we also run at a higher frequency.
--
ankur
Various versions of this patchset have been extensively tested at Ampere with numerous benchmarks by our performance testing groups. Please merge this. We will continue to contribute to this patchset.
On 2024/9/26 7:24, Ankur Arora wrote:
> This patchset enables the cpuidle-haltpoll driver and its namesake
> governor on arm64. This is specifically interesting for KVM guests by
> reducing IPC latencies.
>
> Comparing idle switching latencies on an arm64 KVM guest with
> perf bench sched pipe:
>
> usecs/op %stdev
>
> no haltpoll (baseline) 13.48 +- 5.19%
> with haltpoll 6.84 +- 22.07%
>
>
> No change in performance for a similar test on x86:
>
> usecs/op %stdev
>
> haltpoll w/ cpu_relax() (baseline) 4.75 +- 1.76%
> haltpoll w/ smp_cond_load_relaxed() 4.78 +- 2.31%
>
> Both sets of tests were on otherwise idle systems with guest VCPUs
> pinned to specific PCPUs. One reason for the higher stdev on arm64
> is that trapping of the WFE instruction by the host KVM is contingent
> on the number of tasks on the runqueue.
>
> Tomohiro Misono and Haris Okanovic also report similar latency
> improvements on Grace and Graviton systems (for v7) [1] [2].
>
> The patch series is organized in three parts:
>
> - patch 1, reorganizes the poll_idle() loop, switching to
> smp_cond_load_relaxed() in the polling loop.
> Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
> renaming it to ARCH_HAS_OPTIMIZED_POLL.
>
> - patches 4-6 reorganize the haltpoll selection and init logic
> to allow architecture code to select it.
>
> - and finally, patches 7-11 add the bits for arm64 support.
>
> What is still missing: this series largely completes the haltpoll side
> of functionality for arm64. There are, however, a few related areas
> that still need to be threshed out:
>
> - WFET support: WFE on arm64 does not guarantee that poll_idle()
> would terminate in halt_poll_ns. Using WFET would address this.
> - KVM_NO_POLL support on arm64
> - KVM TWED support on arm64: allow the host to limit time spent in
> WFE.
>
>
> Changelog:
>
> v8: No logic changes. Largely respin of v7, with changes
> noted below:
>
> - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
> own patch.
> (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
>
> - address comments simplifying arm64 support (Will Deacon)
> (patch-11 "arm64: support cpuidle-haltpoll")
>
> v7: No significant logic changes. Mostly a respin of v6.
>
> - minor cleanup in poll_idle() (Christoph Lameter)
> - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
> (Tomohiro Misono)
>
> v6:
>
> - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
> changes together (comment from Christoph Lameter)
> - threshes out the commit messages a bit more (comments from Christoph
> Lameter, Sudeep Holla)
> - also rework selection of cpuidle-haltpoll. Now selected based
> on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
> - moved back to arch_haltpoll_want() (comment from Joao Martins)
> Also, arch_haltpoll_want() now takes the force parameter and is
> now responsible for the complete selection (or not) of haltpoll.
> - fixes the build breakage on i386
> - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
> Tomohiro Misono, Haris Okanovic)
>
>
> v5:
> - rework the poll_idle() loop around smp_cond_load_relaxed() (review
> comment from Tomohiro Misono.)
> - also rework selection of cpuidle-haltpoll. Now selected based
> on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
> - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
> arm64 now depends on the event-stream being enabled.
> - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
> - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
>
> v4 changes from v3:
> - change 7/8 per Rafael input: drop the parens and use ret for the final check
> - add 8/8 which renames the guard for building poll_state
>
> v3 changes from v2:
> - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
> - add Ack-by from Rafael Wysocki on 2/7
>
> v2 changes from v1:
> - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
> (this improves by 50% at least the CPU cycles consumed in the tests above:
> 10,716,881,137 now vs 14,503,014,257 before)
> - removed the ifdef from patch 1 per RafaelW
>
> Please review.
>
> [1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
> [2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
>
> Ankur Arora (6):
> cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
> cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
> arm64: idle: export arch_cpu_idle
> arm64: select ARCH_HAS_OPTIMIZED_POLL
> cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
> arm64: support cpuidle-haltpoll
>
> Joao Martins (4):
> Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
> cpuidle-haltpoll: define arch_haltpoll_want()
> governors/haltpoll: drop kvm_para_available() check
> arm64: define TIF_POLLING_NRFLAG
>
> Mihai Carabas (1):
> cpuidle/poll_state: poll via smp_cond_load_relaxed()
>
> arch/Kconfig | 3 +++
> arch/arm64/Kconfig | 7 +++++++
> arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
> arch/arm64/include/asm/thread_info.h | 2 ++
> arch/arm64/kernel/idle.c | 1 +
> arch/x86/Kconfig | 5 ++---
> arch/x86/include/asm/cpuidle_haltpoll.h | 1 +
> arch/x86/kernel/kvm.c | 13 ++++++++++++
> drivers/acpi/processor_idle.c | 4 ++--
> drivers/cpuidle/Kconfig | 5 ++---
> drivers/cpuidle/Makefile | 2 +-
> drivers/cpuidle/cpuidle-haltpoll.c | 12 +-----------
> drivers/cpuidle/governors/haltpoll.c | 6 +-----
> drivers/cpuidle/poll_state.c | 22 +++++++++++++++------
> drivers/idle/Kconfig | 1 +
> include/linux/cpuidle.h | 2 +-
> include/linux/cpuidle_haltpoll.h | 5 +++++
> 17 files changed, 83 insertions(+), 32 deletions(-)
> create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
>
Hi Ankur,
Thanks for the patches!
We have tested these patches on our machine, with an adaptation of ACPI LPI
states rather than c-states.
Include polling state, there would be three states to get in. Comparing idle
switching latencies of different state with perf bench sched pipe:
usecs/op %stdev
state0(polling state) 7.36 +- 0.35%
state1 8.78 +- 0.46%
state2 77.32 +- 5.50%
It turns out that it works on our machine.
Tested-by: Lifeng Zheng <zhenglifeng1@huawei.com>
The adaptation of ACPI LPI states is shown below as a patch. Feel free to
include this patch as part of your series, or I can also send it out after
your series being merged.
From: Lifeng Zheng <zhenglifeng1@huawei.com>
ACPI: processor_idle: Support polling state for LPI
Initialize an optional polling state besides LPI states.
Wrap up a new enter method to correctly reflect the actual entered state
when the polling state is enabled.
Signed-off-by: Lifeng Zheng <zhenglifeng1@huawei.com>
Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com>
---
drivers/acpi/processor_idle.c | 39 ++++++++++++++++++++++++++++++-----
1 file changed, 34 insertions(+), 5 deletions(-)
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 44096406d65d..d154b5d77328 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -1194,20 +1194,46 @@ static int acpi_idle_lpi_enter(struct cpuidle_device *dev,
return -EINVAL;
}
+/* To correctly reflect the entered state if the poll state is enabled. */
+static int acpi_idle_lpi_enter_with_poll_state(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int index)
+{
+ int entered_state;
+
+ if (unlikely(index < 1))
+ return -EINVAL;
+
+ entered_state = acpi_idle_lpi_enter(dev, drv, index - 1);
+ if (entered_state < 0)
+ return entered_state;
+
+ return entered_state + 1;
+}
+
static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
{
- int i;
+ int i, count;
struct acpi_lpi_state *lpi;
struct cpuidle_state *state;
struct cpuidle_driver *drv = &acpi_idle_driver;
+ typeof(state->enter) enter_method;
if (!pr->flags.has_lpi)
return -EOPNOTSUPP;
+ if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) {
+ cpuidle_poll_state_init(drv);
+ count = 1;
+ enter_method = acpi_idle_lpi_enter_with_poll_state;
+ } else {
+ count = 0;
+ enter_method = acpi_idle_lpi_enter;
+ }
+
for (i = 0; i < pr->power.count && i < CPUIDLE_STATE_MAX; i++) {
lpi = &pr->power.lpi_states[i];
- state = &drv->states[i];
+ state = &drv->states[count];
snprintf(state->name, CPUIDLE_NAME_LEN, "LPI-%d", i);
strscpy(state->desc, lpi->desc, CPUIDLE_DESC_LEN);
state->exit_latency = lpi->wake_latency;
@@ -1215,11 +1241,14 @@ static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
state->flags |= arch_get_idle_state_flags(lpi->arch_flags);
if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
state->flags |= CPUIDLE_FLAG_RCU_IDLE;
- state->enter = acpi_idle_lpi_enter;
- drv->safe_state_index = i;
+ state->enter = enter_method;
+ drv->safe_state_index = count;
+ count++;
+ if (count == CPUIDLE_STATE_MAX)
+ break;
}
- drv->state_count = i;
+ drv->state_count = count;
return 0;
}
--
2.33.0
zhenglifeng (A) <zhenglifeng1@huawei.com> writes:
> On 2024/9/26 7:24, Ankur Arora wrote:
>> This patchset enables the cpuidle-haltpoll driver and its namesake
>> governor on arm64. This is specifically interesting for KVM guests by
>> reducing IPC latencies.
>>
>> Comparing idle switching latencies on an arm64 KVM guest with
>> perf bench sched pipe:
>>
>> usecs/op %stdev
>>
>> no haltpoll (baseline) 13.48 +- 5.19%
>> with haltpoll 6.84 +- 22.07%
>>
>>
>> No change in performance for a similar test on x86:
>>
>> usecs/op %stdev
>>
>> haltpoll w/ cpu_relax() (baseline) 4.75 +- 1.76%
>> haltpoll w/ smp_cond_load_relaxed() 4.78 +- 2.31%
>>
>> Both sets of tests were on otherwise idle systems with guest VCPUs
>> pinned to specific PCPUs. One reason for the higher stdev on arm64
>> is that trapping of the WFE instruction by the host KVM is contingent
>> on the number of tasks on the runqueue.
>>
>> Tomohiro Misono and Haris Okanovic also report similar latency
>> improvements on Grace and Graviton systems (for v7) [1] [2].
>>
>> The patch series is organized in three parts:
>>
>> - patch 1, reorganizes the poll_idle() loop, switching to
>> smp_cond_load_relaxed() in the polling loop.
>> Relatedly patches 2, 3 mangle the config option ARCH_HAS_CPU_RELAX,
>> renaming it to ARCH_HAS_OPTIMIZED_POLL.
>>
>> - patches 4-6 reorganize the haltpoll selection and init logic
>> to allow architecture code to select it.
>>
>> - and finally, patches 7-11 add the bits for arm64 support.
>>
>> What is still missing: this series largely completes the haltpoll side
>> of functionality for arm64. There are, however, a few related areas
>> that still need to be threshed out:
>>
>> - WFET support: WFE on arm64 does not guarantee that poll_idle()
>> would terminate in halt_poll_ns. Using WFET would address this.
>> - KVM_NO_POLL support on arm64
>> - KVM TWED support on arm64: allow the host to limit time spent in
>> WFE.
>>
>>
>> Changelog:
>>
>> v8: No logic changes. Largely respin of v7, with changes
>> noted below:
>>
>> - move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
>> own patch.
>> (patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
>>
>> - address comments simplifying arm64 support (Will Deacon)
>> (patch-11 "arm64: support cpuidle-haltpoll")
>>
>> v7: No significant logic changes. Mostly a respin of v6.
>>
>> - minor cleanup in poll_idle() (Christoph Lameter)
>> - fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
>> (Tomohiro Misono)
>>
>> v6:
>>
>> - reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
>> changes together (comment from Christoph Lameter)
>> - threshes out the commit messages a bit more (comments from Christoph
>> Lameter, Sudeep Holla)
>> - also rework selection of cpuidle-haltpoll. Now selected based
>> on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>> - moved back to arch_haltpoll_want() (comment from Joao Martins)
>> Also, arch_haltpoll_want() now takes the force parameter and is
>> now responsible for the complete selection (or not) of haltpoll.
>> - fixes the build breakage on i386
>> - fixes the cpuidle-haltpoll module breakage on arm64 (comment from
>> Tomohiro Misono, Haris Okanovic)
>>
>>
>> v5:
>> - rework the poll_idle() loop around smp_cond_load_relaxed() (review
>> comment from Tomohiro Misono.)
>> - also rework selection of cpuidle-haltpoll. Now selected based
>> on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
>> - arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
>> arm64 now depends on the event-stream being enabled.
>> - limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
>> - ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
>>
>> v4 changes from v3:
>> - change 7/8 per Rafael input: drop the parens and use ret for the final check
>> - add 8/8 which renames the guard for building poll_state
>>
>> v3 changes from v2:
>> - fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
>> - add Ack-by from Rafael Wysocki on 2/7
>>
>> v2 changes from v1:
>> - added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
>> (this improves by 50% at least the CPU cycles consumed in the tests above:
>> 10,716,881,137 now vs 14,503,014,257 before)
>> - removed the ifdef from patch 1 per RafaelW
>>
>> Please review.
>>
>> [1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
>> [2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
>>
>> Ankur Arora (6):
>> cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
>> cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
>> arm64: idle: export arch_cpu_idle
>> arm64: select ARCH_HAS_OPTIMIZED_POLL
>> cpuidle/poll_state: limit POLL_IDLE_RELAX_COUNT on arm64
>> arm64: support cpuidle-haltpoll
>>
>> Joao Martins (4):
>> Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
>> cpuidle-haltpoll: define arch_haltpoll_want()
>> governors/haltpoll: drop kvm_para_available() check
>> arm64: define TIF_POLLING_NRFLAG
>>
>> Mihai Carabas (1):
>> cpuidle/poll_state: poll via smp_cond_load_relaxed()
>>
>> arch/Kconfig | 3 +++
>> arch/arm64/Kconfig | 7 +++++++
>> arch/arm64/include/asm/cpuidle_haltpoll.h | 24 +++++++++++++++++++++++
>> arch/arm64/include/asm/thread_info.h | 2 ++
>> arch/arm64/kernel/idle.c | 1 +
>> arch/x86/Kconfig | 5 ++---
>> arch/x86/include/asm/cpuidle_haltpoll.h | 1 +
>> arch/x86/kernel/kvm.c | 13 ++++++++++++
>> drivers/acpi/processor_idle.c | 4 ++--
>> drivers/cpuidle/Kconfig | 5 ++---
>> drivers/cpuidle/Makefile | 2 +-
>> drivers/cpuidle/cpuidle-haltpoll.c | 12 +-----------
>> drivers/cpuidle/governors/haltpoll.c | 6 +-----
>> drivers/cpuidle/poll_state.c | 22 +++++++++++++++------
>> drivers/idle/Kconfig | 1 +
>> include/linux/cpuidle.h | 2 +-
>> include/linux/cpuidle_haltpoll.h | 5 +++++
>> 17 files changed, 83 insertions(+), 32 deletions(-)
>> create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
>>
>
> Hi Ankur,
>
> Thanks for the patches!
>
> We have tested these patches on our machine, with an adaptation of ACPI LPI
> states rather than c-states.
>
> Include polling state, there would be three states to get in. Comparing idle
> switching latencies of different state with perf bench sched pipe:
>
> usecs/op %stdev
>
> state0(polling state) 7.36 +- 0.35%
> state1 8.78 +- 0.46%
> state2 77.32 +- 5.50%
>
> It turns out that it works on our machine.
>
> Tested-by: Lifeng Zheng <zhenglifeng1@huawei.com>
Great. Thanks Lifeng.
> The adaptation of ACPI LPI states is shown below as a patch. Feel free to
> include this patch as part of your series, or I can also send it out after
> your series being merged.
Ah, so polling for the regular ACPI driver. From a quick look the
patch looks good but this series is mostly focused on haltpoll so I
think this patch can go in after.
Please Cc me when you send it.
Thanks
Ankur
> From: Lifeng Zheng <zhenglifeng1@huawei.com>
>
> ACPI: processor_idle: Support polling state for LPI
>
> Initialize an optional polling state besides LPI states.
>
> Wrap up a new enter method to correctly reflect the actual entered state
> when the polling state is enabled.
>
> Signed-off-by: Lifeng Zheng <zhenglifeng1@huawei.com>
> Reviewed-by: Jie Zhan <zhanjie9@hisilicon.com>
> ---
> drivers/acpi/processor_idle.c | 39 ++++++++++++++++++++++++++++++-----
> 1 file changed, 34 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 44096406d65d..d154b5d77328 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -1194,20 +1194,46 @@ static int acpi_idle_lpi_enter(struct cpuidle_device *dev,
> return -EINVAL;
> }
>
> +/* To correctly reflect the entered state if the poll state is enabled. */
> +static int acpi_idle_lpi_enter_with_poll_state(struct cpuidle_device *dev,
> + struct cpuidle_driver *drv, int index)
> +{
> + int entered_state;
> +
> + if (unlikely(index < 1))
> + return -EINVAL;
> +
> + entered_state = acpi_idle_lpi_enter(dev, drv, index - 1);
> + if (entered_state < 0)
> + return entered_state;
> +
> + return entered_state + 1;
> +}
> +
> static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
> {
> - int i;
> + int i, count;
> struct acpi_lpi_state *lpi;
> struct cpuidle_state *state;
> struct cpuidle_driver *drv = &acpi_idle_driver;
> + typeof(state->enter) enter_method;
>
> if (!pr->flags.has_lpi)
> return -EOPNOTSUPP;
>
> + if (IS_ENABLED(CONFIG_ARCH_HAS_OPTIMIZED_POLL)) {
> + cpuidle_poll_state_init(drv);
> + count = 1;
> + enter_method = acpi_idle_lpi_enter_with_poll_state;
> + } else {
> + count = 0;
> + enter_method = acpi_idle_lpi_enter;
> + }
> +
> for (i = 0; i < pr->power.count && i < CPUIDLE_STATE_MAX; i++) {
> lpi = &pr->power.lpi_states[i];
>
> - state = &drv->states[i];
> + state = &drv->states[count];
> snprintf(state->name, CPUIDLE_NAME_LEN, "LPI-%d", i);
> strscpy(state->desc, lpi->desc, CPUIDLE_DESC_LEN);
> state->exit_latency = lpi->wake_latency;
> @@ -1215,11 +1241,14 @@ static int acpi_processor_setup_lpi_states(struct acpi_processor *pr)
> state->flags |= arch_get_idle_state_flags(lpi->arch_flags);
> if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
> state->flags |= CPUIDLE_FLAG_RCU_IDLE;
> - state->enter = acpi_idle_lpi_enter;
> - drv->safe_state_index = i;
> + state->enter = enter_method;
> + drv->safe_state_index = count;
> + count++;
> + if (count == CPUIDLE_STATE_MAX)
> + break;
> }
>
> - drv->state_count = i;
> + drv->state_count = count;
>
> return 0;
> }
--
ankur
Hi Ankur, Catalin, How about the following series based on a refactor of arm64's delay()? Does it address your earlier concerns? delay() already implements wfet() and falls back to wfe() w/ evstream or a cpu_relax loop. I refactored it to poll an address, and wrapped in a new platform-agnostic smp_vcond_load_relaxed() macro. More details in the following commit log. Regards, Haris Okanovic AWS Graviton Software
Haris Okanovic <harisokn@amazon.com> writes: > Hi Ankur, Catalin, > > How about the following series based on a refactor of arm64's delay()? > Does it address your earlier concerns? > > delay() already implements wfet() and falls back to wfe() w/ evstream > or a cpu_relax loop. I refactored it to poll an address, and wrapped in > a new platform-agnostic smp_vcond_load_relaxed() macro. More details in > the following commit log. Haven't looked at your series too closely but it looks quite a bit different from the version I was working on. Let me send out my version as well in the next few days. -- ankur
Relaxed poll until desired mask/value is observed at the specified
address or timeout.
This macro is a specialization of the generic smp_cond_load_relaxed(),
which takes a simple mask/value condition (vcond) instead of an
arbitrary expression. It allows architectures to better specialize the
implementation, e.g. to enable wfe() polling of the address on arm.
Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
include/asm-generic/barrier.h | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..112027eabbfc 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -256,6 +256,31 @@ do { \
})
#endif
+/**
+ * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
+ * with no ordering guarantees. Spins until `(*addr & mask) == val` or
+ * `nsecs` elapse, and returns the last observed `*addr` value.
+ *
+ * @nsecs: timeout in nanoseconds
+ * @addr: pointer to an integer
+ * @mask: a bit mask applied to read values
+ * @val: Expected value with mask
+ */
+#ifndef smp_vcond_load_relaxed
+#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({ \
+ const u64 __start = local_clock_noinstr(); \
+ u64 __nsecs = (nsecs); \
+ typeof(addr) __addr = (addr); \
+ typeof(*__addr) __mask = (mask); \
+ typeof(*__addr) __val = (val); \
+ typeof(*__addr) __cur; \
+ smp_cond_load_relaxed(__addr, ( \
+ (VAL & __mask) == __val || \
+ local_clock_noinstr() - __start > __nsecs \
+ )); \
+})
+#endif
+
/**
* smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
* @ptr: pointer to the variable to wait on
--
2.34.1
On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> Relaxed poll until desired mask/value is observed at the specified
> address or timeout.
>
> This macro is a specialization of the generic smp_cond_load_relaxed(),
> which takes a simple mask/value condition (vcond) instead of an
> arbitrary expression. It allows architectures to better specialize the
> implementation, e.g. to enable wfe() polling of the address on arm.
This doesn't make sense to me. The existing smp_cond_load() functions
already use wfe on arm64 and I don't see why we need a special helper
just to do a mask.
> Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> ---
> include/asm-generic/barrier.h | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..112027eabbfc 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -256,6 +256,31 @@ do { \
> })
> #endif
>
> +/**
> + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> + * `nsecs` elapse, and returns the last observed `*addr` value.
> + *
> + * @nsecs: timeout in nanoseconds
> + * @addr: pointer to an integer
> + * @mask: a bit mask applied to read values
> + * @val: Expected value with mask
> + */
> +#ifndef smp_vcond_load_relaxed
I know naming is hard, but "vcond" is especially terrible.
Perhaps smp_cond_load_timeout()?
Will
On Wed, 2024-11-06 at 11:39 +0000, Will Deacon wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > Relaxed poll until desired mask/value is observed at the specified
> > address or timeout.
> >
> > This macro is a specialization of the generic smp_cond_load_relaxed(),
> > which takes a simple mask/value condition (vcond) instead of an
> > arbitrary expression. It allows architectures to better specialize the
> > implementation, e.g. to enable wfe() polling of the address on arm.
>
> This doesn't make sense to me. The existing smp_cond_load() functions
> already use wfe on arm64 and I don't see why we need a special helper
> just to do a mask.
We can't turn an arbitrary C expression into a wfe()/wfet() exit
condition, which is one of the inputs to the existing smp_cond_load().
This API is therefore more amenable to hardware acceleration.
>
> > Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> > ---
> > include/asm-generic/barrier.h | 25 +++++++++++++++++++++++++
> > 1 file changed, 25 insertions(+)
> >
> > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > index d4f581c1e21d..112027eabbfc 100644
> > --- a/include/asm-generic/barrier.h
> > +++ b/include/asm-generic/barrier.h
> > @@ -256,6 +256,31 @@ do { \
> > })
> > #endif
> >
> > +/**
> > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > + *
> > + * @nsecs: timeout in nanoseconds
> > + * @addr: pointer to an integer
> > + * @mask: a bit mask applied to read values
> > + * @val: Expected value with mask
> > + */
> > +#ifndef smp_vcond_load_relaxed
>
> I know naming is hard, but "vcond" is especially terrible.
> Perhaps smp_cond_load_timeout()?
I agree, naming is hard! I was trying to differentiate it from
smp_cond_load() in some meaningful way - that one is an "expression"
condition this one is a "value" condition.
I'll think it over a bit more.
>
> Will
On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..112027eabbfc 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -256,6 +256,31 @@ do { \
> })
> #endif
>
> +/**
> + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> + * `nsecs` elapse, and returns the last observed `*addr` value.
> + *
> + * @nsecs: timeout in nanoseconds
FWIW, I don't mind the relative timeout, it makes the API easier to use.
Yes, it may take longer in absolute time if the thread is scheduled out
before local_clock_noinstr() is read but the same can happen in the
caller anyway. It's similar to udelay(), it can take longer if the
thread is scheduled out.
> + * @addr: pointer to an integer
> + * @mask: a bit mask applied to read values
> + * @val: Expected value with mask
> + */
> +#ifndef smp_vcond_load_relaxed
> +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({ \
> + const u64 __start = local_clock_noinstr(); \
> + u64 __nsecs = (nsecs); \
> + typeof(addr) __addr = (addr); \
> + typeof(*__addr) __mask = (mask); \
> + typeof(*__addr) __val = (val); \
> + typeof(*__addr) __cur; \
> + smp_cond_load_relaxed(__addr, ( \
> + (VAL & __mask) == __val || \
> + local_clock_noinstr() - __start > __nsecs \
> + )); \
> +})
The generic implementation has the same problem as Ankur's current
series. smp_cond_load_relaxed() can't wait on anything other than the
variable at __addr. If it goes into a WFE, there's nothing executed to
read the timer and check for progress. Any generic implementation of
such function would have to use cpu_relax() and polling.
--
Catalin
On Wed, 2024-11-06 at 11:08 +0000, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > index d4f581c1e21d..112027eabbfc 100644
> > --- a/include/asm-generic/barrier.h
> > +++ b/include/asm-generic/barrier.h
> > @@ -256,6 +256,31 @@ do { \
> > })
> > #endif
> >
> > +/**
> > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > + *
> > + * @nsecs: timeout in nanoseconds
>
> FWIW, I don't mind the relative timeout, it makes the API easier to use.
> Yes, it may take longer in absolute time if the thread is scheduled out
> before local_clock_noinstr() is read but the same can happen in the
> caller anyway. It's similar to udelay(), it can take longer if the
> thread is scheduled out.
>
> > + * @addr: pointer to an integer
> > + * @mask: a bit mask applied to read values
> > + * @val: Expected value with mask
> > + */
> > +#ifndef smp_vcond_load_relaxed
> > +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({ \
> > + const u64 __start = local_clock_noinstr(); \
> > + u64 __nsecs = (nsecs); \
> > + typeof(addr) __addr = (addr); \
> > + typeof(*__addr) __mask = (mask); \
> > + typeof(*__addr) __val = (val); \
> > + typeof(*__addr) __cur; \
> > + smp_cond_load_relaxed(__addr, ( \
> > + (VAL & __mask) == __val || \
> > + local_clock_noinstr() - __start > __nsecs \
> > + )); \
> > +})
>
> The generic implementation has the same problem as Ankur's current
> series. smp_cond_load_relaxed() can't wait on anything other than the
> variable at __addr. If it goes into a WFE, there's nothing executed to
> read the timer and check for progress. Any generic implementation of
> such function would have to use cpu_relax() and polling.
How would the caller enter wfe()? Can you give a specific scenario that
you're concerned about?
This code already reduces to a relaxed poll, something like this:
```
start = clock();
while((READ_ONCE(*addr) & mask) != val && (clock() - start) < nsecs) {
cpu_relax();
}
```
>
> --
> Catalin
On Wed, Nov 06, 2024 at 06:13:35PM +0000, Okanovic, Haris wrote:
> On Wed, 2024-11-06 at 11:08 +0000, Catalin Marinas wrote:
> > On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > > index d4f581c1e21d..112027eabbfc 100644
> > > --- a/include/asm-generic/barrier.h
> > > +++ b/include/asm-generic/barrier.h
> > > @@ -256,6 +256,31 @@ do { \
> > > })
> > > #endif
> > >
> > > +/**
> > > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > > + *
> > > + * @nsecs: timeout in nanoseconds
> >
> > FWIW, I don't mind the relative timeout, it makes the API easier to use.
> > Yes, it may take longer in absolute time if the thread is scheduled out
> > before local_clock_noinstr() is read but the same can happen in the
> > caller anyway. It's similar to udelay(), it can take longer if the
> > thread is scheduled out.
> >
> > > + * @addr: pointer to an integer
> > > + * @mask: a bit mask applied to read values
> > > + * @val: Expected value with mask
> > > + */
> > > +#ifndef smp_vcond_load_relaxed
> > > +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({ \
> > > + const u64 __start = local_clock_noinstr(); \
> > > + u64 __nsecs = (nsecs); \
> > > + typeof(addr) __addr = (addr); \
> > > + typeof(*__addr) __mask = (mask); \
> > > + typeof(*__addr) __val = (val); \
> > > + typeof(*__addr) __cur; \
> > > + smp_cond_load_relaxed(__addr, ( \
> > > + (VAL & __mask) == __val || \
> > > + local_clock_noinstr() - __start > __nsecs \
> > > + )); \
> > > +})
> >
> > The generic implementation has the same problem as Ankur's current
> > series. smp_cond_load_relaxed() can't wait on anything other than the
> > variable at __addr. If it goes into a WFE, there's nothing executed to
> > read the timer and check for progress. Any generic implementation of
> > such function would have to use cpu_relax() and polling.
>
> How would the caller enter wfe()? Can you give a specific scenario that
> you're concerned about?
Let's take the arm64 example with the event stream disabled. Without the
subsequent patches implementing smp_vcond_load_relaxed(), just expand
the arm64 smp_cond_load_relaxed() implementation in the above macro. If
the timer check doesn't trigger an exit from the loop,
__cmpwait_relaxed() only waits on the variable to change its value,
nothing to do with the timer.
> This code already reduces to a relaxed poll, something like this:
>
> ```
> start = clock();
> while((READ_ONCE(*addr) & mask) != val && (clock() - start) < nsecs) {
> cpu_relax();
> }
> ```
Well, that's if you also use the generic implementation of
smp_cond_load_relaxed() but have you checked all the other architectures
that don't do something similar to the arm64 wfe (riscv comes close)?
Even if all other architectures just use a cpu_relax(), that's still
abusing the smp_cond_load_relaxed() semantics. And what if one places
another loop in their __cmpwait()? That's allowed because you are
supposed to wait on a single variable to change not on multiple states.
--
Catalin
On Wed, 2024-11-06 at 19:55 +0000, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Wed, Nov 06, 2024 at 06:13:35PM +0000, Okanovic, Haris wrote:
> > On Wed, 2024-11-06 at 11:08 +0000, Catalin Marinas wrote:
> > > On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > > > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > > > index d4f581c1e21d..112027eabbfc 100644
> > > > --- a/include/asm-generic/barrier.h
> > > > +++ b/include/asm-generic/barrier.h
> > > > @@ -256,6 +256,31 @@ do { \
> > > > })
> > > > #endif
> > > >
> > > > +/**
> > > > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > > > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > > > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > > > + *
> > > > + * @nsecs: timeout in nanoseconds
> > >
> > > FWIW, I don't mind the relative timeout, it makes the API easier to use.
> > > Yes, it may take longer in absolute time if the thread is scheduled out
> > > before local_clock_noinstr() is read but the same can happen in the
> > > caller anyway. It's similar to udelay(), it can take longer if the
> > > thread is scheduled out.
> > >
> > > > + * @addr: pointer to an integer
> > > > + * @mask: a bit mask applied to read values
> > > > + * @val: Expected value with mask
> > > > + */
> > > > +#ifndef smp_vcond_load_relaxed
> > > > +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({ \
> > > > + const u64 __start = local_clock_noinstr(); \
> > > > + u64 __nsecs = (nsecs); \
> > > > + typeof(addr) __addr = (addr); \
> > > > + typeof(*__addr) __mask = (mask); \
> > > > + typeof(*__addr) __val = (val); \
> > > > + typeof(*__addr) __cur; \
> > > > + smp_cond_load_relaxed(__addr, ( \
> > > > + (VAL & __mask) == __val || \
> > > > + local_clock_noinstr() - __start > __nsecs \
> > > > + )); \
> > > > +})
> > >
> > > The generic implementation has the same problem as Ankur's current
> > > series. smp_cond_load_relaxed() can't wait on anything other than the
> > > variable at __addr. If it goes into a WFE, there's nothing executed to
> > > read the timer and check for progress. Any generic implementation of
> > > such function would have to use cpu_relax() and polling.
> >
> > How would the caller enter wfe()? Can you give a specific scenario that
> > you're concerned about?
>
> Let's take the arm64 example with the event stream disabled. Without the
> subsequent patches implementing smp_vcond_load_relaxed(), just expand
> the arm64 smp_cond_load_relaxed() implementation in the above macro. If
> the timer check doesn't trigger an exit from the loop,
> __cmpwait_relaxed() only waits on the variable to change its value,
> nothing to do with the timer.
>
> > This code already reduces to a relaxed poll, something like this:
> >
> > ```
> > start = clock();
> > while((READ_ONCE(*addr) & mask) != val && (clock() - start) < nsecs) {
> > cpu_relax();
> > }
> > ```
>
> Well, that's if you also use the generic implementation of
> smp_cond_load_relaxed() but have you checked all the other architectures
> that don't do something similar to the arm64 wfe (riscv comes close)?
> Even if all other architectures just use a cpu_relax(), that's still
> abusing the smp_cond_load_relaxed() semantics. And what if one places
> another loop in their __cmpwait()? That's allowed because you are
> supposed to wait on a single variable to change not on multiple states.
I see what you mean now - I glossed over the use of __cmpwait_relaxed()
in smp_cond_load_relaxed(). I'll post another rev with the fix, similar
to the above "reduced" code.
>
> --
> Catalin
On Tue, 5 Nov 2024, Haris Okanovic wrote: > +/** > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or > + * `nsecs` elapse, and returns the last observed `*addr` value. > + * > + * @nsecs: timeout in nanoseconds Please use an absolute time in nsecs instead of a timeout. You do not know what will happen to your execution thread until the local_clock_noinstr() is run.
On Tue, 2024-11-05 at 11:36 -0800, Christoph Lameter (Ampere) wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Tue, 5 Nov 2024, Haris Okanovic wrote: > > > +/** > > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address > > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or > > + * `nsecs` elapse, and returns the last observed `*addr` value. > > + * > > + * @nsecs: timeout in nanoseconds > > Please use an absolute time in nsecs instead of a timeout. I went with relative time because it clock agnostic. I agree deadline is nicer because it can propagate down layers of functions, but it pins the caller to single time base. > You do not know > what will happen to your execution thread until the local_clock_noinstr() > is run. Not sure what you mean. Could you perhaps give an example where it would break? > > One alternative is to do timekeeping with delay() in all cases, to decouple from sched/clock.
Perform an exclusive load, which atomically loads a word and arms the
exclusive monitor to enable wfet()/wfe() accelerated polling.
https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-accesses/exclusive-monitors
Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
create mode 100644 arch/arm64/include/asm/readex.h
diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
new file mode 100644
index 000000000000..51963c3107e1
--- /dev/null
+++ b/arch/arm64/include/asm/readex.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Based on arch/arm64/include/asm/rwonce.h
+ *
+ * Copyright (C) 2020 Google LLC.
+ * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#ifndef __ASM_READEX_H
+#define __ASM_READEX_H
+
+#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
+
+#define __READ_ONCE_EX(x) \
+({ \
+ typeof(&(x)) __x = &(x); \
+ int atomic = 1; \
+ union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \
+ switch (sizeof(x)) { \
+ case 1: \
+ asm volatile(__LOAD_EX(b, %w0, %1) \
+ : "=r" (*(__u8 *)__u.__c) \
+ : "Q" (*__x) : "memory"); \
+ break; \
+ case 2: \
+ asm volatile(__LOAD_EX(h, %w0, %1) \
+ : "=r" (*(__u16 *)__u.__c) \
+ : "Q" (*__x) : "memory"); \
+ break; \
+ case 4: \
+ asm volatile(__LOAD_EX(, %w0, %1) \
+ : "=r" (*(__u32 *)__u.__c) \
+ : "Q" (*__x) : "memory"); \
+ break; \
+ case 8: \
+ asm volatile(__LOAD_EX(, %0, %1) \
+ : "=r" (*(__u64 *)__u.__c) \
+ : "Q" (*__x) : "memory"); \
+ break; \
+ default: \
+ atomic = 0; \
+ } \
+ atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\
+})
+
+#endif
--
2.34.1
From: Haris Okanovic > Sent: 05 November 2024 18:31 > > Perform an exclusive load, which atomically loads a word and arms the > exclusive monitor to enable wfet()/wfe() accelerated polling. > ... > + atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\ That doesn't do what you want it to do. (It is wrong in READ_ONCE() as well.) ?: is treated like an arithmetic operator and the result will get promoted to 'int'. Moving the first cast outside the ?: probably works: (typeof(*__x))(atomic ? __u.__val : (*(volatile typeof(__x))__x)); David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Tue, Nov 05, 2024 at 12:30:38PM -0600, Haris Okanovic wrote:
> Perform an exclusive load, which atomically loads a word and arms the
> exclusive monitor to enable wfet()/wfe() accelerated polling.
>
> https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-accesses/exclusive-monitors
>
> Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> ---
> arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
> 1 file changed, 46 insertions(+)
> create mode 100644 arch/arm64/include/asm/readex.h
>
> diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
> new file mode 100644
> index 000000000000..51963c3107e1
> --- /dev/null
> +++ b/arch/arm64/include/asm/readex.h
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Based on arch/arm64/include/asm/rwonce.h
> + *
> + * Copyright (C) 2020 Google LLC.
> + * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
> + */
> +
> +#ifndef __ASM_READEX_H
> +#define __ASM_READEX_H
> +
> +#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
> +
> +#define __READ_ONCE_EX(x) \
> +({ \
> + typeof(&(x)) __x = &(x); \
> + int atomic = 1; \
> + union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \
> + switch (sizeof(x)) { \
> + case 1: \
> + asm volatile(__LOAD_EX(b, %w0, %1) \
> + : "=r" (*(__u8 *)__u.__c) \
> + : "Q" (*__x) : "memory"); \
> + break; \
> + case 2: \
> + asm volatile(__LOAD_EX(h, %w0, %1) \
> + : "=r" (*(__u16 *)__u.__c) \
> + : "Q" (*__x) : "memory"); \
> + break; \
> + case 4: \
> + asm volatile(__LOAD_EX(, %w0, %1) \
> + : "=r" (*(__u32 *)__u.__c) \
> + : "Q" (*__x) : "memory"); \
> + break; \
> + case 8: \
> + asm volatile(__LOAD_EX(, %0, %1) \
> + : "=r" (*(__u64 *)__u.__c) \
> + : "Q" (*__x) : "memory"); \
> + break; \
> + default: \
> + atomic = 0; \
> + } \
> + atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\
> +})
I think this is a bad idea. Load-exclusive needs to be used very carefully,
preferably when you're able to see exactly what instructions it's
interacting with. By making this into a macro, we're at the mercy of the
compiler and we give the wrong impression that you could e.g. build atomic
critical sections out of this macro.
So I'm fairly strongly against this interface.
Will
On Wed, 2024-11-06 at 11:43 +0000, Will Deacon wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Tue, Nov 05, 2024 at 12:30:38PM -0600, Haris Okanovic wrote:
> > Perform an exclusive load, which atomically loads a word and arms the
> > exclusive monitor to enable wfet()/wfe() accelerated polling.
> >
> > https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/exclusive-accesses/exclusive-monitors
> >
> > Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> > ---
> > arch/arm64/include/asm/readex.h | 46 +++++++++++++++++++++++++++++++++
> > 1 file changed, 46 insertions(+)
> > create mode 100644 arch/arm64/include/asm/readex.h
> >
> > diff --git a/arch/arm64/include/asm/readex.h b/arch/arm64/include/asm/readex.h
> > new file mode 100644
> > index 000000000000..51963c3107e1
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/readex.h
> > @@ -0,0 +1,46 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Based on arch/arm64/include/asm/rwonce.h
> > + *
> > + * Copyright (C) 2020 Google LLC.
> > + * Copyright (C) 2024 Amazon.com, Inc. or its affiliates.
> > + */
> > +
> > +#ifndef __ASM_READEX_H
> > +#define __ASM_READEX_H
> > +
> > +#define __LOAD_EX(sfx, regs...) "ldaxr" #sfx "\t" #regs
> > +
> > +#define __READ_ONCE_EX(x) \
> > +({ \
> > + typeof(&(x)) __x = &(x); \
> > + int atomic = 1; \
> > + union { __unqual_scalar_typeof(*__x) __val; char __c[1]; } __u; \
> > + switch (sizeof(x)) { \
> > + case 1: \
> > + asm volatile(__LOAD_EX(b, %w0, %1) \
> > + : "=r" (*(__u8 *)__u.__c) \
> > + : "Q" (*__x) : "memory"); \
> > + break; \
> > + case 2: \
> > + asm volatile(__LOAD_EX(h, %w0, %1) \
> > + : "=r" (*(__u16 *)__u.__c) \
> > + : "Q" (*__x) : "memory"); \
> > + break; \
> > + case 4: \
> > + asm volatile(__LOAD_EX(, %w0, %1) \
> > + : "=r" (*(__u32 *)__u.__c) \
> > + : "Q" (*__x) : "memory"); \
> > + break; \
> > + case 8: \
> > + asm volatile(__LOAD_EX(, %0, %1) \
> > + : "=r" (*(__u64 *)__u.__c) \
> > + : "Q" (*__x) : "memory"); \
> > + break; \
> > + default: \
> > + atomic = 0; \
> > + } \
> > + atomic ? (typeof(*__x))__u.__val : (*(volatile typeof(__x))__x);\
> > +})
>
> I think this is a bad idea. Load-exclusive needs to be used very carefully,
> preferably when you're able to see exactly what instructions it's
> interacting with. By making this into a macro, we're at the mercy of the
> compiler and we give the wrong impression that you could e.g. build atomic
> critical sections out of this macro.
>
> So I'm fairly strongly against this interface.
Fair point. I'll post an alternate delay() implementation in asm. It's
a simple routine.
>
> Will
On Tue, 5 Nov 2024, Haris Okanovic wrote: > +#define __READ_ONCE_EX(x) \ This is derived from READ_ONCE and named similarly so maybe it would better be placed into rwonce.h?
On Tue, 2024-11-05 at 11:39 -0800, Christoph Lameter (Ampere) wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > On Tue, 5 Nov 2024, Haris Okanovic wrote: > > > +#define __READ_ONCE_EX(x) \ > > This is derived from READ_ONCE and named similarly so maybe it would > better be placed into rwonce.h? > > I plan to remove this macro per other feedback in this thread.
Refactor arm64's delay() to poll for a mask/value condition (vcond) in
it's wfet(), wfe(), and relaxed polling loops.
Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
arch/arm64/lib/delay.c | 70 ++++++++++++++++++++++++++++++------------
1 file changed, 50 insertions(+), 20 deletions(-)
diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index cb2062e7e234..a7c3040af316 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -14,43 +14,73 @@
#include <linux/timex.h>
#include <clocksource/arm_arch_timer.h>
+#include <asm/readex.h>
-#define USECS_TO_CYCLES(time_usecs) \
- xloops_to_cycles((time_usecs) * 0x10C7UL)
-
-static inline unsigned long xloops_to_cycles(unsigned long xloops)
+static inline u64 xloops_to_cycles(u64 xloops)
{
return (xloops * loops_per_jiffy * HZ) >> 32;
}
-void __delay(unsigned long cycles)
+#define USECS_TO_XLOOPS(time_usecs) \
+ ((time_usecs) * 0x10C7UL)
+
+#define USECS_TO_CYCLES(time_usecs) \
+ xloops_to_cycles(USECS_TO_XLOOPS(time_usecs))
+
+#define NSECS_TO_XLOOPS(time_nsecs) \
+ ((time_nsecs) * 0x10C7UL)
+
+#define NSECS_TO_CYCLES(time_nsecs) \
+ xloops_to_cycles(NSECS_TO_XLOOPS(time_nsecs))
+
+static unsigned long __delay_until_ul(u64 cycles, unsigned long* addr, unsigned long mask, unsigned long val)
{
- cycles_t start = get_cycles();
+ u64 start = get_cycles();
+ unsigned long cur;
if (alternative_has_cap_unlikely(ARM64_HAS_WFXT)) {
u64 end = start + cycles;
- /*
- * Start with WFIT. If an interrupt makes us resume
- * early, use a WFET loop to complete the delay.
- */
- wfit(end);
- while ((get_cycles() - start) < cycles)
+ do {
+ cur = __READ_ONCE_EX(*addr);
+ if ((cur & mask) == val) {
+ break;
+ }
wfet(end);
- } else if (arch_timer_evtstrm_available()) {
- const cycles_t timer_evt_period =
+ } while ((get_cycles() - start) < cycles);
+ } else if (arch_timer_evtstrm_available()) {
+ const u64 timer_evt_period =
USECS_TO_CYCLES(ARCH_TIMER_EVT_STREAM_PERIOD_US);
- while ((get_cycles() - start + timer_evt_period) < cycles)
+ do {
+ cur = __READ_ONCE_EX(*addr);
+ if ((cur & mask) == val) {
+ break;
+ }
wfe();
+ } while ((get_cycles() - start + timer_evt_period) < cycles);
+ } else {
+ do {
+ cur = __READ_ONCE_EX(*addr);
+ if ((cur & mask) == val) {
+ break;
+ }
+ cpu_relax();
+ } while ((get_cycles() - start) < cycles);
}
- while ((get_cycles() - start) < cycles)
- cpu_relax();
+ return cur;
+}
+
+void __delay(unsigned long cycles)
+{
+ /* constant word for wfet()/wfe() to poll */
+ unsigned long dummy ____cacheline_aligned = 0;
+ __delay_until_ul(cycles, &dummy, 0, 1);
}
EXPORT_SYMBOL(__delay);
-inline void __const_udelay(unsigned long xloops)
+void __const_udelay(unsigned long xloops)
{
__delay(xloops_to_cycles(xloops));
}
@@ -58,12 +88,12 @@ EXPORT_SYMBOL(__const_udelay);
void __udelay(unsigned long usecs)
{
- __const_udelay(usecs * 0x10C7UL); /* 2**32 / 1000000 (rounded up) */
+ __delay(USECS_TO_CYCLES(usecs));
}
EXPORT_SYMBOL(__udelay);
void __ndelay(unsigned long nsecs)
{
- __const_udelay(nsecs * 0x5UL); /* 2**32 / 1000000000 (rounded up) */
+ __delay(NSECS_TO_CYCLES(nsecs));
}
EXPORT_SYMBOL(__ndelay);
--
2.34.1
On Tue, Nov 05, 2024 at 12:30:39PM -0600, Haris Okanovic wrote:
> + do {
> + cur = __READ_ONCE_EX(*addr);
> + if ((cur & mask) == val) {
> + break;
> + }
> wfet(end);
Constructs like this need to be entirely in assembly. The compiler may
spill 'cur' onto the stack and the write could clear the exclusive
monitor which makes the wfet return immediately. That's highly CPU
implementation specific but it's the reason we have functions like
__cmpwait() in assembly (or whatever else deals with exclusives). IOW,
we can't have other memory accesses between the LDXR and whatever is
consuming the exclusive monitor armed state (typically STXR but WFE/WFET
can be another).
--
Catalin
On Wed, 2024-11-06 at 09:18 +0000, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Tue, Nov 05, 2024 at 12:30:39PM -0600, Haris Okanovic wrote:
> > + do {
> > + cur = __READ_ONCE_EX(*addr);
> > + if ((cur & mask) == val) {
> > + break;
> > + }
> > wfet(end);
>
> Constructs like this need to be entirely in assembly. The compiler may
> spill 'cur' onto the stack and the write could clear the exclusive
> monitor which makes the wfet return immediately. That's highly CPU
> implementation specific but it's the reason we have functions like
> __cmpwait() in assembly (or whatever else deals with exclusives). IOW,
> we can't have other memory accesses between the LDXR and whatever is
> consuming the exclusive monitor armed state (typically STXR but WFE/WFET
> can be another).
Agreed, will rewrite parts of delay() in asm.
>
> --
> Catalin
On Tue, 5 Nov 2024, Haris Okanovic wrote:
> -#define USECS_TO_CYCLES(time_usecs) \
> - xloops_to_cycles((time_usecs) * 0x10C7UL)
> -
> -static inline unsigned long xloops_to_cycles(unsigned long xloops)
> +static inline u64 xloops_to_cycles(u64 xloops)
> {
> return (xloops * loops_per_jiffy * HZ) >> 32;
> }
>
> -void __delay(unsigned long cycles)
> +#define USECS_TO_XLOOPS(time_usecs) \
> + ((time_usecs) * 0x10C7UL)
> +
> +#define USECS_TO_CYCLES(time_usecs) \
> + xloops_to_cycles(USECS_TO_XLOOPS(time_usecs))
> +
> +#define NSECS_TO_XLOOPS(time_nsecs) \
> + ((time_nsecs) * 0x10C7UL)
The constant here is the same value as for microseconds. If I remember
correctly its 5UL for nanoseconds.
On Tue, 2024-11-05 at 11:42 -0800, Christoph Lameter (Ampere) wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Tue, 5 Nov 2024, Haris Okanovic wrote:
>
> > -#define USECS_TO_CYCLES(time_usecs) \
> > - xloops_to_cycles((time_usecs) * 0x10C7UL)
> > -
> > -static inline unsigned long xloops_to_cycles(unsigned long xloops)
> > +static inline u64 xloops_to_cycles(u64 xloops)
> > {
> > return (xloops * loops_per_jiffy * HZ) >> 32;
> > }
> >
> > -void __delay(unsigned long cycles)
> > +#define USECS_TO_XLOOPS(time_usecs) \
> > + ((time_usecs) * 0x10C7UL)
> > +
> > +#define USECS_TO_CYCLES(time_usecs) \
> > + xloops_to_cycles(USECS_TO_XLOOPS(time_usecs))
> > +
>
>
> > +#define NSECS_TO_XLOOPS(time_nsecs) \
> > + ((time_nsecs) * 0x10C7UL)
>
> The constant here is the same value as for microseconds. If I remember
> correctly its 5UL for nanoseconds.
>
You're right, good catch. Should be `nsecs * 0x5UL` per old code.
Implement smp_vcond_load_relaxed() atop __delay_until_ul() on arm64,
to reduce number of busy loops while waiting for a value condition.
This implementation only support unsigned long words. It can be extended
via the enclosed case structure in barrier.h as needed.
Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
arch/arm64/include/asm/barrier.h | 18 ++++++++++++++++++
arch/arm64/lib/delay.c | 16 ++++++++++++++++
2 files changed, 34 insertions(+)
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 1ca947d5c939..188327e3ce72 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -203,6 +203,24 @@ do { \
(typeof(*ptr))VAL; \
})
+extern unsigned long __smp_vcond_load_relaxed_ul(
+ u64 nsecs, unsigned long* addr, unsigned long mask, unsigned long val);
+
+#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({ \
+ u64 __nsecs = (nsecs); \
+ typeof(addr) __addr = (addr); \
+ typeof(*__addr) __mask = (mask); \
+ typeof(*__addr) __val = (val); \
+ typeof(*__addr) __cur; \
+ switch (sizeof(*__addr)) { \
+ case sizeof(unsigned long): \
+ __cur = __smp_vcond_load_relaxed_ul( \
+ __nsecs, __addr, __mask, __val); \
+ break; \
+ } \
+ (__cur); \
+})
+
#define smp_cond_load_acquire(ptr, cond_expr) \
({ \
typeof(ptr) __PTR = (ptr); \
diff --git a/arch/arm64/lib/delay.c b/arch/arm64/lib/delay.c
index a7c3040af316..a61a13b04439 100644
--- a/arch/arm64/lib/delay.c
+++ b/arch/arm64/lib/delay.c
@@ -12,6 +12,7 @@
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/timex.h>
+#include <linux/sched/clock.h>
#include <clocksource/arm_arch_timer.h>
#include <asm/readex.h>
@@ -97,3 +98,18 @@ void __ndelay(unsigned long nsecs)
__delay(NSECS_TO_CYCLES(nsecs));
}
EXPORT_SYMBOL(__ndelay);
+
+unsigned long __smp_vcond_load_relaxed_ul(
+ u64 nsecs, unsigned long* addr, unsigned long mask, unsigned long val)
+{
+ const u64 start = local_clock_noinstr();
+ const u64 cycles = NSECS_TO_CYCLES(nsecs);
+ unsigned long cur;
+
+ do {
+ cur = __delay_until_ul(cycles, addr, mask, val);
+ } while((cur & mask) != val && local_clock_noinstr() - start < nsecs);
+
+ return cur;
+}
+EXPORT_SYMBOL(__smp_vcond_load_relaxed_ul);
--
2.34.1
Implement poll_idle() using smp_vcond_load_relaxed() function.
Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
drivers/cpuidle/poll_state.c | 36 +++++-------------------------------
1 file changed, 5 insertions(+), 31 deletions(-)
diff --git a/drivers/cpuidle/poll_state.c b/drivers/cpuidle/poll_state.c
index 61df2395585e..5553e6f31702 100644
--- a/drivers/cpuidle/poll_state.c
+++ b/drivers/cpuidle/poll_state.c
@@ -7,46 +7,20 @@
#include <linux/sched.h>
#include <linux/sched/clock.h>
#include <linux/sched/idle.h>
-
-#ifdef CONFIG_ARM64
-/*
- * POLL_IDLE_RELAX_COUNT determines how often we check for timeout
- * while polling for TIF_NEED_RESCHED in thread_info->flags.
- *
- * Set this to a low value since arm64, instead of polling, uses a
- * event based mechanism.
- */
-#define POLL_IDLE_RELAX_COUNT 1
-#else
-#define POLL_IDLE_RELAX_COUNT 200
-#endif
+#include <asm/barrier.h>
static int __cpuidle poll_idle(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index)
{
- u64 time_start;
-
- time_start = local_clock_noinstr();
+ unsigned long flags;
dev->poll_time_limit = false;
raw_local_irq_enable();
if (!current_set_polling_and_test()) {
- u64 limit;
-
- limit = cpuidle_poll_time(drv, dev);
-
- while (!need_resched()) {
- unsigned int loop_count = 0;
- if (local_clock_noinstr() - time_start > limit) {
- dev->poll_time_limit = true;
- break;
- }
-
- smp_cond_load_relaxed(¤t_thread_info()->flags,
- VAL & _TIF_NEED_RESCHED ||
- loop_count++ >= POLL_IDLE_RELAX_COUNT);
- }
+ u64 limit = cpuidle_poll_time(drv, dev);
+ flags = smp_vcond_load_relaxed(limit, ¤t_thread_info()->flags, _TIF_NEED_RESCHED, _TIF_NEED_RESCHED);
+ dev->poll_time_limit = !(flags & _TIF_NEED_RESCHED);
}
raw_local_irq_disable();
--
2.34.1
On Tue, 5 Nov 2024, Haris Okanovic wrote:
> {
> - u64 time_start;
> -
> - time_start = local_clock_noinstr();
> + unsigned long flags;
Lets keep recording that start value here. Otherwise the timeout could me
later than expected.
© 2016 - 2026 Red Hat, Inc.