This patchset adds support for polling in idle via poll_idle() on
arm64.
There are two main changes in this version:
1. rework the series to take Catalin Marinas' comments on the semantics
of smp_cond_load_relaxed() (and how earlier versions of this
series were abusing them) into account.
This also allows dropping of the somewhat strained connections
between haltpoll and the event-stream.
2. earlier versions of this series were adding support for poll_idle()
but only using it in the haltpoll driver. Add Lifeng's patch to
broaden it out by also polling in acpi-idle.
The benefit of polling in idle is to reduce the cost of remote wakeups.
When enabled, these can be done just by setting the need-resched bit,
instead of sending an IPI, and incurring the cost of handling the
interrupt on the receiver side. When running on a VM it also saves
the cost of WFE trapping (when enabled.)
Comparing sched-pipe performance on a guest VM:
# perf stat -r 5 --cpu 4,5 -e task-clock,cycles,instructions,sched:sched_wake_idle_without_ipi \
perf bench sched pipe -l 1000000 -c 4
# no polling in idle
Performance counter stats for 'CPU(s) 4,5' (5 runs):
25,229.57 msec task-clock # 2.000 CPUs utilized ( +- 7.75% )
45,821,250,284 cycles # 1.816 GHz ( +- 10.07% )
26,557,496,665 instructions # 0.58 insn per cycle ( +- 0.21% )
0 sched:sched_wake_idle_without_ipi # 0.000 /sec
12.615 +- 0.977 seconds time elapsed ( +- 7.75% )
# polling in idle (with haltpoll):
Performance counter stats for 'CPU(s) 4,5' (5 runs):
15,131.58 msec task-clock # 2.000 CPUs utilized ( +- 10.00% )
34,158,188,839 cycles # 2.257 GHz ( +- 6.91% )
20,824,950,916 instructions # 0.61 insn per cycle ( +- 0.09% )
1,983,822 sched:sched_wake_idle_without_ipi # 131.105 K/sec ( +- 0.78% )
7.566 +- 0.756 seconds time elapsed ( +- 10.00% )
Tomohiro Misono and Haris Okanovic also report similar latency
improvements on Grace and Graviton systems (for v7) [1] [2].
Lifeng also reports improved context switch latency on a bare-metal
machine with acpi-idle [3].
The series is in four parts:
- patches 1-4,
"asm-generic: add barrier smp_cond_load_relaxed_timeout()"
"cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout()"
"cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL"
"Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig"
add smp_cond_load_relaxed_timeout() and switch poll_idle() to
using it. Also, do some munging of related kconfig options.
- patches 5-7,
"arm64: barrier: add support for smp_cond_relaxed_timeout()"
"arm64: define TIF_POLLING_NRFLAG"
"arm64: add support for polling in idle"
add support for the new barrier, the polling flag and enable
poll_idle() support.
- patches 8, 9-13,
"ACPI: processor_idle: Support polling state for LPI"
"cpuidle-haltpoll: define arch_haltpoll_want()"
"governors/haltpoll: drop kvm_para_available() check"
"cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL"
"arm64: idle: export arch_cpu_idle"
"arm64: support cpuidle-haltpoll"
add support for polling via acpi-idle, and cpuidle-haltpoll.
- patches 14, 15,
"arm64/delay: move some constants out to a separate header"
"arm64: support WFET in smp_cond_relaxed_timeout()"
are RFC patches to enable WFET support.
Changelog:
v9:
- reworked the series to address a comment from Catalin Marinas
about how v8 was abusing semantics of smp_cond_load_relaxed().
- add poll_idle() support in acpi-idle (Lifeng Zheng)
- dropped some earlier "Tested-by", "Reviewed-by" due to the
above rework.
v8: No logic changes. Largely respin of v7, with changes
noted below:
- move selection of ARCH_HAS_OPTIMIZED_POLL on arm64 to its
own patch.
(patch-9 "arm64: select ARCH_HAS_OPTIMIZED_POLL")
- address comments simplifying arm64 support (Will Deacon)
(patch-11 "arm64: support cpuidle-haltpoll")
v7: No significant logic changes. Mostly a respin of v6.
- minor cleanup in poll_idle() (Christoph Lameter)
- fixes conflicts due to code movement in arch/arm64/kernel/cpuidle.c
(Tomohiro Misono)
v6:
- reordered the patches to keep poll_idle() and ARCH_HAS_OPTIMIZED_POLL
changes together (comment from Christoph Lameter)
- threshes out the commit messages a bit more (comments from Christoph
Lameter, Sudeep Holla)
- also rework selection of cpuidle-haltpoll. Now selected based
on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
- moved back to arch_haltpoll_want() (comment from Joao Martins)
Also, arch_haltpoll_want() now takes the force parameter and is
now responsible for the complete selection (or not) of haltpoll.
- fixes the build breakage on i386
- fixes the cpuidle-haltpoll module breakage on arm64 (comment from
Tomohiro Misono, Haris Okanovic)
v5:
- rework the poll_idle() loop around smp_cond_load_relaxed() (review
comment from Tomohiro Misono.)
- also rework selection of cpuidle-haltpoll. Now selected based
on the architectural selection of ARCH_CPUIDLE_HALTPOLL.
- arch_haltpoll_supported() (renamed from arch_haltpoll_want()) on
arm64 now depends on the event-stream being enabled.
- limit POLL_IDLE_RELAX_COUNT on arm64 (review comment from Haris Okanovic)
- ARCH_HAS_CPU_RELAX is now renamed to ARCH_HAS_OPTIMIZED_POLL.
v4 changes from v3:
- change 7/8 per Rafael input: drop the parens and use ret for the final check
- add 8/8 which renames the guard for building poll_state
v3 changes from v2:
- fix 1/7 per Petr Mladek - remove ARCH_HAS_CPU_RELAX from arch/x86/Kconfig
- add Ack-by from Rafael Wysocki on 2/7
v2 changes from v1:
- added patch 7 where we change cpu_relax with smp_cond_load_relaxed per PeterZ
(this improves by 50% at least the CPU cycles consumed in the tests above:
10,716,881,137 now vs 14,503,014,257 before)
- removed the ifdef from patch 1 per RafaelW
Please review.
[1] https://lore.kernel.org/lkml/TY3PR01MB111481E9B0AF263ACC8EA5D4AE5BA2@TY3PR01MB11148.jpnprd01.prod.outlook.com/
[2] https://lore.kernel.org/lkml/104d0ec31cb45477e27273e089402d4205ee4042.camel@amazon.com/
[3] https://lore.kernel.org/lkml/f8a1f85b-c4bf-4c38-81bf-728f72a4f2fe@huawei.com/
Ankur Arora (10):
asm-generic: add barrier smp_cond_load_relaxed_timeout()
cpuidle/poll_state: poll via smp_cond_load_relaxed_timeout()
cpuidle: rename ARCH_HAS_CPU_RELAX to ARCH_HAS_OPTIMIZED_POLL
arm64: barrier: add support for smp_cond_relaxed_timeout()
arm64: add support for polling in idle
cpuidle-haltpoll: condition on ARCH_CPUIDLE_HALTPOLL
arm64: idle: export arch_cpu_idle
arm64: support cpuidle-haltpoll
arm64/delay: move some constants out to a separate header
arm64: support WFET in smp_cond_relaxed_timeout()
Joao Martins (4):
Kconfig: move ARCH_HAS_OPTIMIZED_POLL to arch/Kconfig
arm64: define TIF_POLLING_NRFLAG
cpuidle-haltpoll: define arch_haltpoll_want()
governors/haltpoll: drop kvm_para_available() check
Lifeng Zheng (1):
ACPI: processor_idle: Support polling state for LPI
arch/Kconfig | 3 ++
arch/arm64/Kconfig | 7 +++
arch/arm64/include/asm/barrier.h | 62 ++++++++++++++++++++++-
arch/arm64/include/asm/cmpxchg.h | 26 ++++++----
arch/arm64/include/asm/cpuidle_haltpoll.h | 20 ++++++++
arch/arm64/include/asm/delay-const.h | 25 +++++++++
arch/arm64/include/asm/thread_info.h | 2 +
arch/arm64/kernel/idle.c | 1 +
arch/arm64/lib/delay.c | 13 ++---
arch/x86/Kconfig | 5 +-
arch/x86/include/asm/cpuidle_haltpoll.h | 1 +
arch/x86/kernel/kvm.c | 13 +++++
drivers/acpi/processor_idle.c | 43 +++++++++++++---
drivers/cpuidle/Kconfig | 5 +-
drivers/cpuidle/Makefile | 2 +-
drivers/cpuidle/cpuidle-haltpoll.c | 12 +----
drivers/cpuidle/governors/haltpoll.c | 6 +--
drivers/cpuidle/poll_state.c | 27 +++-------
drivers/idle/Kconfig | 1 +
include/asm-generic/barrier.h | 42 +++++++++++++++
include/linux/cpuidle.h | 2 +-
include/linux/cpuidle_haltpoll.h | 5 ++
22 files changed, 252 insertions(+), 71 deletions(-)
create mode 100644 arch/arm64/include/asm/cpuidle_haltpoll.h
create mode 100644 arch/arm64/include/asm/delay-const.h
--
2.43.5