[PATCH v10 10/11] arm64: idle: export arch_cpu_idle()

Ankur Arora posted 11 patches 10 months ago
[PATCH v10 10/11] arm64: idle: export arch_cpu_idle()
Posted by Ankur Arora 10 months ago
Needed for cpuidle-haltpoll.

Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/arm64/kernel/idle.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
index 05cfb347ec26..b85ba0df9b02 100644
--- a/arch/arm64/kernel/idle.c
+++ b/arch/arm64/kernel/idle.c
@@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
 	 */
 	cpu_do_idle();
 }
+EXPORT_SYMBOL_GPL(arch_cpu_idle);
-- 
2.43.5
Re: [PATCH v10 10/11] arm64: idle: export arch_cpu_idle()
Posted by Shuai Xue 8 months, 1 week ago

在 2025/2/19 05:33, Ankur Arora 写道:
> Needed for cpuidle-haltpoll.
> 
> Acked-by: Will Deacon <will@kernel.org>
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> ---
>   arch/arm64/kernel/idle.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
> index 05cfb347ec26..b85ba0df9b02 100644
> --- a/arch/arm64/kernel/idle.c
> +++ b/arch/arm64/kernel/idle.c
> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
>   	 */
>   	cpu_do_idle();

Hi, Ankur,

With haltpoll_driver registered, arch_cpu_idle() on x86 can select
mwait_idle() in idle threads.

It use MONITOR sets up an effective address range that is monitored
for write-to-memory activities; MWAIT places the processor in
an optimized state (this may vary between different implementations)
until a write to the monitored address range occurs.

Should arch_cpu_idle() on arm64 also use the LDXR/WFE
to avoid wakeup IPI like x86 monitor/mwait?

Thanks.
Shuai


Re: [PATCH v10 10/11] arm64: idle: export arch_cpu_idle()
Posted by Ankur Arora 8 months, 1 week ago
Shuai Xue <xueshuai@linux.alibaba.com> writes:

> 在 2025/2/19 05:33, Ankur Arora 写道:
>> Needed for cpuidle-haltpoll.
>> Acked-by: Will Deacon <will@kernel.org>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> ---
>>   arch/arm64/kernel/idle.c | 1 +
>>   1 file changed, 1 insertion(+)
>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
>> index 05cfb347ec26..b85ba0df9b02 100644
>> --- a/arch/arm64/kernel/idle.c
>> +++ b/arch/arm64/kernel/idle.c
>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
>>   	 */
>>   	cpu_do_idle();
>
> Hi, Ankur,
>
> With haltpoll_driver registered, arch_cpu_idle() on x86 can select
> mwait_idle() in idle threads.
>
> It use MONITOR sets up an effective address range that is monitored
> for write-to-memory activities; MWAIT places the processor in
> an optimized state (this may vary between different implementations)
> until a write to the monitored address range occurs.

MWAIT is more capable than WFE -- it allows selection of deeper idle
state. IIRC C2/C3.

> Should arch_cpu_idle() on arm64 also use the LDXR/WFE
> to avoid wakeup IPI like x86 monitor/mwait?

Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support
that this series adds.

As Haris notes, the negative with only using WFE is that it only allows
a single idle state, one that is fairly shallow because the event-stream
causes a wakeup every 100us.

--
ankur
Re: [PATCH v10 10/11] arm64: idle: export arch_cpu_idle()
Posted by Shuai Xue 8 months, 1 week ago

在 2025/4/12 04:57, Ankur Arora 写道:
> 
> Shuai Xue <xueshuai@linux.alibaba.com> writes:
> 
>> 在 2025/2/19 05:33, Ankur Arora 写道:
>>> Needed for cpuidle-haltpoll.
>>> Acked-by: Will Deacon <will@kernel.org>
>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>> ---
>>>    arch/arm64/kernel/idle.c | 1 +
>>>    1 file changed, 1 insertion(+)
>>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
>>> index 05cfb347ec26..b85ba0df9b02 100644
>>> --- a/arch/arm64/kernel/idle.c
>>> +++ b/arch/arm64/kernel/idle.c
>>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
>>>    	 */
>>>    	cpu_do_idle();
>>
>> Hi, Ankur,
>>
>> With haltpoll_driver registered, arch_cpu_idle() on x86 can select
>> mwait_idle() in idle threads.
>>
>> It use MONITOR sets up an effective address range that is monitored
>> for write-to-memory activities; MWAIT places the processor in
>> an optimized state (this may vary between different implementations)
>> until a write to the monitored address range occurs.
> 
> MWAIT is more capable than WFE -- it allows selection of deeper idle
> state. IIRC C2/C3.
> 
>> Should arch_cpu_idle() on arm64 also use the LDXR/WFE
>> to avoid wakeup IPI like x86 monitor/mwait?
> 
> Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support
> that this series adds.
> 
> As Haris notes, the negative with only using WFE is that it only allows
> a single idle state, one that is fairly shallow because the event-stream
> causes a wakeup every 100us.
> 
> --
> ankur

Hi, Ankur and Haris

Got it, thanks for explaination :)

Comparing sched-pipe performance on Rund with Yitian 710, *IPC improved 35%*:

w/o haltpoll
Performance counter stats for 'CPU(s) 0,1' (5 runs):

     32521.53 msec task-clock                #    2.000 CPUs utilized            ( +-  1.16% )
  38081402726      cycles                    #    1.171 GHz                      ( +-  1.70% )
  27324614561      instructions              #    0.72  insn per cycle           ( +-  0.12% )
          181      sched:sched_wake_idle_without_ipi #    0.006 K/sec

w/ haltpoll
Performance counter stats for 'CPU(s) 0,1' (5 runs):

      9477.15 msec task-clock                #    2.000 CPUs utilized            ( +-  0.89% )
  21486828269      cycles                    #    2.267 GHz                      ( +-  0.35% )
  23867109747      instructions              #    1.11  insn per cycle           ( +-  0.11% )
      1925207      sched:sched_wake_idle_without_ipi #    0.203 M/sec

Comparing sched-pipe performance on QEMU with Kunpeng 920, *IPC improved 10%*:

w/o haltpoll
Performance counter stats for 'CPU(s) 0,1' (5 runs):

          34,007.89 msec task-clock                       #    2.000 CPUs utilized               ( +-  8.86% )
      4,407,859,620      cycles                           #    0.130 GHz                         ( +- 84.92% )
      2,482,046,461      instructions                     #    0.56  insn per cycle              ( +- 88.27% )
                 16      sched:sched_wake_idle_without_ipi #    0.470 /sec                        ( +- 98.77% )

              17.00 +- 1.51 seconds time elapsed  ( +-  8.86% )

w/ haltpoll
Performance counter stats for 'CPU(s) 0,1' (5 runs):

          16,894.37 msec task-clock                       #    2.000 CPUs utilized               ( +-  3.80% )
      8,703,158,826      cycles                           #    0.515 GHz                         ( +- 31.31% )
      5,379,257,839      instructions                     #    0.62  insn per cycle              ( +- 30.03% )
            549,434      sched:sched_wake_idle_without_ipi #   32.522 K/sec                       ( +- 30.05% )

              8.447 +- 0.321 seconds time elapsed  ( +-  3.80% )

Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>

Thanks.
Shuai
Re: [PATCH v10 10/11] arm64: idle: export arch_cpu_idle()
Posted by Ankur Arora 8 months, 1 week ago
Shuai Xue <xueshuai@linux.alibaba.com> writes:

> 在 2025/4/12 04:57, Ankur Arora 写道:
>> Shuai Xue <xueshuai@linux.alibaba.com> writes:
>>
>>> 在 2025/2/19 05:33, Ankur Arora 写道:
>>>> Needed for cpuidle-haltpoll.
>>>> Acked-by: Will Deacon <will@kernel.org>
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> ---
>>>>    arch/arm64/kernel/idle.c | 1 +
>>>>    1 file changed, 1 insertion(+)
>>>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
>>>> index 05cfb347ec26..b85ba0df9b02 100644
>>>> --- a/arch/arm64/kernel/idle.c
>>>> +++ b/arch/arm64/kernel/idle.c
>>>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
>>>>    	 */
>>>>    	cpu_do_idle();
>>>
>>> Hi, Ankur,
>>>
>>> With haltpoll_driver registered, arch_cpu_idle() on x86 can select
>>> mwait_idle() in idle threads.
>>>
>>> It use MONITOR sets up an effective address range that is monitored
>>> for write-to-memory activities; MWAIT places the processor in
>>> an optimized state (this may vary between different implementations)
>>> until a write to the monitored address range occurs.
>> MWAIT is more capable than WFE -- it allows selection of deeper idle
>> state. IIRC C2/C3.
>>
>>> Should arch_cpu_idle() on arm64 also use the LDXR/WFE
>>> to avoid wakeup IPI like x86 monitor/mwait?
>> Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support
>> that this series adds.
>> As Haris notes, the negative with only using WFE is that it only allows
>> a single idle state, one that is fairly shallow because the event-stream
>> causes a wakeup every 100us.
>> --
>> ankur
>
> Hi, Ankur and Haris
>
> Got it, thanks for explaination :)
>
> Comparing sched-pipe performance on Rund with Yitian 710, *IPC improved 35%*:

Thanks for testing Shuai. I wasn't expecting the IPC to improve by quite
that much :). The reduced instructions make sense since we don't have to
handle the IRQ anymore but we would spend some of the saved cycles
waiting in WFE instead.

I'm not familiar with the Yitian 710. Can you check if you are running
with WFE? That's the __smp_cond_load_relaxed_timewait() path vs the
__smp_cond_load_relaxed_spinwait() path in [0]. Same question for the
Kunpeng 920.

Also, I'm working on a new version of the series in [1]. Would you be
okay trying that out?

Thanks
Ankur

[0] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
[1] https://lore.kernel.org/lkml/20250203214911.898276-4-ankur.a.arora@oracle.com/

> w/o haltpoll
> Performance counter stats for 'CPU(s) 0,1' (5 runs):
>
>     32521.53 msec task-clock                #    2.000 CPUs utilized            ( +-  1.16% )
>  38081402726      cycles                    #    1.171 GHz                      ( +-  1.70% )
>  27324614561      instructions              #    0.72  insn per cycle           ( +-  0.12% )
>          181      sched:sched_wake_idle_without_ipi #    0.006 K/sec
>
> w/ haltpoll
> Performance counter stats for 'CPU(s) 0,1' (5 runs):
>
>      9477.15 msec task-clock                #    2.000 CPUs utilized            ( +-  0.89% )
>  21486828269      cycles                    #    2.267 GHz                      ( +-  0.35% )
>  23867109747      instructions              #    1.11  insn per cycle           ( +-  0.11% )
>      1925207      sched:sched_wake_idle_without_ipi #    0.203 M/sec
>
> Comparing sched-pipe performance on QEMU with Kunpeng 920, *IPC improved 10%*:
>
> w/o haltpoll
> Performance counter stats for 'CPU(s) 0,1' (5 runs):
>
>          34,007.89 msec task-clock                       #    2.000 CPUs utilized               ( +-  8.86% )
>      4,407,859,620      cycles                           #    0.130 GHz                         ( +- 84.92% )
>      2,482,046,461      instructions                     #    0.56  insn per cycle              ( +- 88.27% )
>                 16      sched:sched_wake_idle_without_ipi #    0.470 /sec                        ( +- 98.77% )
>
>              17.00 +- 1.51 seconds time elapsed  ( +-  8.86% )
>
> w/ haltpoll
> Performance counter stats for 'CPU(s) 0,1' (5 runs):
>
>          16,894.37 msec task-clock                       #    2.000 CPUs utilized               ( +-  3.80% )
>      8,703,158,826      cycles                           #    0.515 GHz                         ( +- 31.31% )
>      5,379,257,839      instructions                     #    0.62  insn per cycle              ( +- 30.03% )
>            549,434      sched:sched_wake_idle_without_ipi #   32.522 K/sec                       ( +- 30.05% )
>
>              8.447 +- 0.321 seconds time elapsed  ( +-  3.80% )
>
> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
>
> Thanks.
> Shuai
Re: [PATCH v10 10/11] arm64: idle: export arch_cpu_idle()
Posted by Shuai Xue 8 months, 1 week ago

在 2025/4/14 11:46, Ankur Arora 写道:
> 
> Shuai Xue <xueshuai@linux.alibaba.com> writes:
> 
>> 在 2025/4/12 04:57, Ankur Arora 写道:
>>> Shuai Xue <xueshuai@linux.alibaba.com> writes:
>>>
>>>> 在 2025/2/19 05:33, Ankur Arora 写道:
>>>>> Needed for cpuidle-haltpoll.
>>>>> Acked-by: Will Deacon <will@kernel.org>
>>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>>> ---
>>>>>     arch/arm64/kernel/idle.c | 1 +
>>>>>     1 file changed, 1 insertion(+)
>>>>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
>>>>> index 05cfb347ec26..b85ba0df9b02 100644
>>>>> --- a/arch/arm64/kernel/idle.c
>>>>> +++ b/arch/arm64/kernel/idle.c
>>>>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
>>>>>     	 */
>>>>>     	cpu_do_idle();
>>>>
>>>> Hi, Ankur,
>>>>
>>>> With haltpoll_driver registered, arch_cpu_idle() on x86 can select
>>>> mwait_idle() in idle threads.
>>>>
>>>> It use MONITOR sets up an effective address range that is monitored
>>>> for write-to-memory activities; MWAIT places the processor in
>>>> an optimized state (this may vary between different implementations)
>>>> until a write to the monitored address range occurs.
>>> MWAIT is more capable than WFE -- it allows selection of deeper idle
>>> state. IIRC C2/C3.
>>>
>>>> Should arch_cpu_idle() on arm64 also use the LDXR/WFE
>>>> to avoid wakeup IPI like x86 monitor/mwait?
>>> Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support
>>> that this series adds.
>>> As Haris notes, the negative with only using WFE is that it only allows
>>> a single idle state, one that is fairly shallow because the event-stream
>>> causes a wakeup every 100us.
>>> --
>>> ankur
>>
>> Hi, Ankur and Haris
>>
>> Got it, thanks for explaination :)
>>
>> Comparing sched-pipe performance on Rund with Yitian 710, *IPC improved 35%*:
> 
> Thanks for testing Shuai. I wasn't expecting the IPC to improve by quite
> that much :). The reduced instructions make sense since we don't have to
> handle the IRQ anymore but we would spend some of the saved cycles
> waiting in WFE instead.
> 
> I'm not familiar with the Yitian 710. Can you check if you are running
> with WFE? That's the __smp_cond_load_relaxed_timewait() path vs the
> __smp_cond_load_relaxed_spinwait() path in [0]. Same question for the
> Kunpeng 920.

Yes, it running with __smp_cond_load_relaxed_timewait().

I use perf-probe to check if WFE is available in Guest:

perf probe 'arch_timer_evtstrm_available%return r=$retval'
perf record -e probe:arch_timer_evtstrm_available__return -aR sleep 1
perf script
swapper       0 [000]  1360.063049: probe:arch_timer_evtstrm_available__return: (ffff800080a5c640 <- ffff800080d42764) r=0x1

arch_timer_evtstrm_available returns true, so
__smp_cond_load_relaxed_timewait() is used.

> 
> Also, I'm working on a new version of the series in [1]. Would you be
> okay trying that out?

Sure. Please cc me when you send out a new version.

> 
> Thanks
> Ankur
> 
> [0] https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com/
> [1] https://lore.kernel.org/lkml/20250203214911.898276-4-ankur.a.arora@oracle.com/
> 

Thanks.
Shuai
Re: [PATCH v10 10/11] arm64: idle: export arch_cpu_idle()
Posted by Ankur Arora 8 months ago
Shuai Xue <xueshuai@linux.alibaba.com> writes:

> 在 2025/4/14 11:46, Ankur Arora 写道:
>> Shuai Xue <xueshuai@linux.alibaba.com> writes:
>>
>>> 在 2025/4/12 04:57, Ankur Arora 写道:
>>>> Shuai Xue <xueshuai@linux.alibaba.com> writes:
>>>>
>>>>> 在 2025/2/19 05:33, Ankur Arora 写道:
>>>>>> Needed for cpuidle-haltpoll.
>>>>>> Acked-by: Will Deacon <will@kernel.org>
>>>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>>>> ---
>>>>>>     arch/arm64/kernel/idle.c | 1 +
>>>>>>     1 file changed, 1 insertion(+)
>>>>>> diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
>>>>>> index 05cfb347ec26..b85ba0df9b02 100644
>>>>>> --- a/arch/arm64/kernel/idle.c
>>>>>> +++ b/arch/arm64/kernel/idle.c
>>>>>> @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
>>>>>>     	 */
>>>>>>     	cpu_do_idle();
>>>>>
>>>>> Hi, Ankur,
>>>>>
>>>>> With haltpoll_driver registered, arch_cpu_idle() on x86 can select
>>>>> mwait_idle() in idle threads.
>>>>>
>>>>> It use MONITOR sets up an effective address range that is monitored
>>>>> for write-to-memory activities; MWAIT places the processor in
>>>>> an optimized state (this may vary between different implementations)
>>>>> until a write to the monitored address range occurs.
>>>> MWAIT is more capable than WFE -- it allows selection of deeper idle
>>>> state. IIRC C2/C3.
>>>>
>>>>> Should arch_cpu_idle() on arm64 also use the LDXR/WFE
>>>>> to avoid wakeup IPI like x86 monitor/mwait?
>>>> Avoiding the wakeup IPI needs TIF_NR_POLLING and polling in idle support
>>>> that this series adds.
>>>> As Haris notes, the negative with only using WFE is that it only allows
>>>> a single idle state, one that is fairly shallow because the event-stream
>>>> causes a wakeup every 100us.
>>>> --
>>>> ankur
>>>
>>> Hi, Ankur and Haris
>>>
>>> Got it, thanks for explaination :)
>>>
>>> Comparing sched-pipe performance on Rund with Yitian 710, *IPC improved 35%*:
>> Thanks for testing Shuai. I wasn't expecting the IPC to improve by quite
>> that much :). The reduced instructions make sense since we don't have to
>> handle the IRQ anymore but we would spend some of the saved cycles
>> waiting in WFE instead.
>> I'm not familiar with the Yitian 710. Can you check if you are running
>> with WFE? That's the __smp_cond_load_relaxed_timewait() path vs the
>> __smp_cond_load_relaxed_spinwait() path in [0]. Same question for the
>> Kunpeng 920.
>
> Yes, it running with __smp_cond_load_relaxed_timewait().
>
> I use perf-probe to check if WFE is available in Guest:
>
> perf probe 'arch_timer_evtstrm_available%return r=$retval'
> perf record -e probe:arch_timer_evtstrm_available__return -aR sleep 1
> perf script
> swapper       0 [000]  1360.063049: probe:arch_timer_evtstrm_available__return: (ffff800080a5c640 <- ffff800080d42764) r=0x1
>
> arch_timer_evtstrm_available returns true, so
> __smp_cond_load_relaxed_timewait() is used.

Great. Thanks for checking.

>> Also, I'm working on a new version of the series in [1]. Would you be
>> okay trying that out?
>
> Sure. Please cc me when you send out a new version.

Will do. Thanks!

--
ankur
Re: [PATCH v10 10/11] arm64: idle: export arch_cpu_idle()
Posted by Okanovic, Haris 8 months, 1 week ago
On Fri, 2025-04-11 at 11:32 +0800, Shuai Xue wrote:
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > 
> > 
> > 
> > 在 2025/2/19 05:33, Ankur Arora 写道:
> > > > Needed for cpuidle-haltpoll.
> > > > 
> > > > Acked-by: Will Deacon <will@kernel.org>
> > > > Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> > > > ---
> > > >   arch/arm64/kernel/idle.c | 1 +
> > > >   1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/arch/arm64/kernel/idle.c b/arch/arm64/kernel/idle.c
> > > > index 05cfb347ec26..b85ba0df9b02 100644
> > > > --- a/arch/arm64/kernel/idle.c
> > > > +++ b/arch/arm64/kernel/idle.c
> > > > @@ -43,3 +43,4 @@ void __cpuidle arch_cpu_idle(void)
> > > >        */
> > > >       cpu_do_idle();
> > 
> > Hi, Ankur,
> > 
> > With haltpoll_driver registered, arch_cpu_idle() on x86 can select
> > mwait_idle() in idle threads.
> > 
> > It use MONITOR sets up an effective address range that is monitored
> > for write-to-memory activities; MWAIT places the processor in
> > an optimized state (this may vary between different implementations)
> > until a write to the monitored address range occurs.
> > 
> > Should arch_cpu_idle() on arm64 also use the LDXR/WFE
> > to avoid wakeup IPI like x86 monitor/mwait?

WFE will wake from the event stream, which can have short sub-ms
periods on many systems. May be something to consider when WFET is more
widely available.

> > 
> > Thanks.
> > Shuai
> > 
> > 

Regards,
Haris Okanovic
AWS Graviton Software