cpuidle: teo: fixes and improvements

[PATCH 0/6] cpuidle: teo: fixes and improvements

Posted by Christian Loehle 1 year, 8 months ago

Hi all,
so my investigation into teo lead to the following fixes and
improvements. Logically they are mostly independent, that's why this
cover letter is quite short, details are in the patches.

1/6:
As discussed, the utilization threshold is too high, while
there are benefits in certain workloads, there are quite a few
regressions, too.
2/6:
Especially with the new util threshold, stopping tick makes little
sense when utilized is detected, so don't.
3/6:
Particularly with WFI, even if it's the only state, stopping the tick
has benefits, so enable that in the early bail out.
4/6:
Stopping the tick with 0 cost (if the idle state dictates it) is too
aggressive IMO, so add 1ms constant cost.
XXX: This has the issue of now being counted as idle_miss, so we could
consider adding this to the states, too, but the simple implementation
of this would have the downside that the cost is added to deeper states
even if the tick is already off.
5/6:
Remove the 'recent' intercept logic, see my findings in:
https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/
I haven't found a way to salvage this properly, so I removed it.
The regular intercept seems to decay fast enough to not need this, but
we could change it if that turns out to be the case.
6/6:
The rest of the intercept logic had issues, too.
See the commit.

TODO: add some measurements of common workloads and some simple sanity
tests (like Vincent described in low utilization workloads if the
state selection looks reasonable).
I have some, but more (and more standardized) would be beneficial.

Happy for anyone to take a look and test as well.

Some numbers for context:
Maybe some numbers for context, I'll probably add them to the cover letter.

Comparing:
- IO workload (intercept heavy).
- Timer workload very low utilization (check for deepest state)
- hackbench (high utilization)
all on RK3399 with CONFIG_HZ=100.
target_residencies: 1, 900, 2000

1. IO workload, 5 runs, results sorted, in read IOPS.
fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8;

teo fixed:
/dev/nvme0n1
[4597, 4673, 4727, 4741, 4756]
/dev/mmcblk2
[5753, 5832, 5837, 5911, 5949]
/dev/mmcblk1
[2059, 2062, 2070, 2071, 2080]

teo mainline:
/dev/nvme0n1
[3793, 3825, 3846, 3865, 3964]
/dev/mmcblk2
[3831, 4110, 4154, 4203, 4228]
/dev/mmcblk1
[1559, 1564, 1596, 1611, 1618]

menu:
/dev/nvme0n1
[2571, 2630, 2804, 2813, 2917]
/dev/mmcblk2
[4181, 4260, 5062, 5260, 5329]
/dev/mmcblk1
[1567, 1581, 1585, 1603, 1769]

2. Timer workload (through IO for my convenience ;) )
Results in read IOPS, fio same as above.
echo "0 2097152 zero" | dmsetup create dm-zeros
echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow
(Each IO is delayed by timer of 50ms, should be mostly in state2)

teo fixed:
3269 cpu_idle total
48 cpu_idle_miss
30 cpu_idle_miss above
18 cpu_idle_miss below

teo mainline:
3221 cpu_idle total
1269 cpu_idle_miss
22 cpu_idle_miss above
1247 cpu_idle_miss below

menu:
3433 cpu_idle total
114 cpu_idle_miss
61 cpu_idle_miss above
53 cpu_idle_miss below

Residencies:
teo mainline isn't in state2 at all, teo fixed is more in state2 than menu, but
both are >95%.

tldr: overall teo fixed spends more time in state2 while having
fewer idle_miss than menu.
teo mainline was just way too aggressive at selecting shallow states.

3. Hackbench, 5 runs
for i in $(seq 0 4); do hackbench -l 100 -g 100 ; sleep 1; done

teo fixed:
Time: 4.807
Time: 4.856
Time: 5.072
Time: 4.934
Time: 4.962

teo mainline:
Time: 4.945
Time: 5.021
Time: 4.927
Time: 4.923
Time: 5.137

menu:
Time: 4.991
Time: 4.884
Time: 4.880
Time: 4.946
Time: 4.980

tldr: all comparable, teo mainline slightly worse

Kind Regards,
Christian

Christian Loehle (6):
  cpuidle: teo: Increase util-threshold
  cpuidle: teo: Don't stop tick on utilized
  cpuidle: teo: Don't always stop tick on one state
  cpuidle: teo: Increase minimum time to stop tick
  cpuidle: teo: Remove recent intercepts metric
  cpuidle: teo: Don't count non-existent intercepts

 drivers/cpuidle/governors/teo.c | 121 +++++++++++++-------------------
 1 file changed, 48 insertions(+), 73 deletions(-)

--
2.34.1

Re: [PATCH 0/6] cpuidle: teo: fixes and improvements

Posted by Christian Loehle 1 year, 8 months ago

On 6/6/24 10:00, Christian Loehle wrote:
> Hi all,
> so my investigation into teo lead to the following fixes and
> improvements. Logically they are mostly independent, that's why this
> cover letter is quite short, details are in the patches.
> 
> 1/6:
> As discussed, the utilization threshold is too high, while
> there are benefits in certain workloads, there are quite a few
> regressions, too.
> 2/6:
> Especially with the new util threshold, stopping tick makes little
> sense when utilized is detected, so don't.
> 3/6:
> Particularly with WFI, even if it's the only state, stopping the tick
> has benefits, so enable that in the early bail out.
> 4/6:
> Stopping the tick with 0 cost (if the idle state dictates it) is too
> aggressive IMO, so add 1ms constant cost.
> XXX: This has the issue of now being counted as idle_miss, so we could
> consider adding this to the states, too, but the simple implementation
> of this would have the downside that the cost is added to deeper states
> even if the tick is already off.
> 5/6:
> Remove the 'recent' intercept logic, see my findings in:
> https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/
> I haven't found a way to salvage this properly, so I removed it.
> The regular intercept seems to decay fast enough to not need this, but
> we could change it if that turns out to be the case.
> 6/6:
> The rest of the intercept logic had issues, too.
> See the commit.
> 
> TODO: add some measurements of common workloads and some simple sanity
> tests (like Vincent described in low utilization workloads if the
> state selection looks reasonable).
> I have some, but more (and more standardized) would be beneficial.
> 
> Happy for anyone to take a look and test as well.
> 
> Some numbers for context:
> Maybe some numbers for context, I'll probably add them to the cover letter.
> 
> Comparing:
> - IO workload (intercept heavy).
> - Timer workload very low utilization (check for deepest state)
> - hackbench (high utilization)
> all on RK3399 with CONFIG_HZ=100.
> target_residencies: 1, 900, 2000
> 
> 1. IO workload, 5 runs, results sorted, in read IOPS.
> fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8;
> 
> teo fixed:
> /dev/nvme0n1
> [4597, 4673, 4727, 4741, 4756]
> /dev/mmcblk2
> [5753, 5832, 5837, 5911, 5949]
> /dev/mmcblk1
> [2059, 2062, 2070, 2071, 2080]
> 
> teo mainline:
> /dev/nvme0n1
> [3793, 3825, 3846, 3865, 3964]
> /dev/mmcblk2
> [3831, 4110, 4154, 4203, 4228]
> /dev/mmcblk1
> [1559, 1564, 1596, 1611, 1618]
> 
> menu:
> /dev/nvme0n1
> [2571, 2630, 2804, 2813, 2917]
> /dev/mmcblk2
> [4181, 4260, 5062, 5260, 5329]
> /dev/mmcblk1
> [1567, 1581, 1585, 1603, 1769]
> 
> 2. Timer workload (through IO for my convenience ;) )
> Results in read IOPS, fio same as above.
> echo "0 2097152 zero" | dmsetup create dm-zeros
> echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow
> (Each IO is delayed by timer of 50ms, should be mostly in state2)
> 
> teo fixed:
> 3269 cpu_idle total
> 48 cpu_idle_miss
> 30 cpu_idle_miss above
> 18 cpu_idle_miss below
> 
> teo mainline:
> 3221 cpu_idle total
> 1269 cpu_idle_miss
> 22 cpu_idle_miss above
> 1247 cpu_idle_miss below
> 
> menu:
> 3433 cpu_idle total
> 114 cpu_idle_miss
> 61 cpu_idle_miss above
> 53 cpu_idle_miss below
> 
> Residencies:

Hmm, maybe actually including them would've been helpful too:
(Over 5s workload, only showing LITTLE cluster)
teo fixed:
idle_state 	
2.0 	4.813378
-1.0 	0.210820
1.0 	0.202778
0.0 	0.062426

teo mainline:
idle_state
1.0 	4.895766
-1.0 	0.098063
0.0 	0.253069

menu:
idle_state
2.0 	4.528356
-1.0 	0.241486
1.0 	0.345829
0.0 	0.202505

> 
> tldr: overall teo fixed spends more time in state2 while having
> fewer idle_miss than menu.
> teo mainline was just way too aggressive at selecting shallow states.
> 
> 3. Hackbench, 5 runs
> for i in $(seq 0 4); do hackbench -l 100 -g 100 ; sleep 1; done
> 
> teo fixed:
> Time: 4.807
> Time: 4.856
> Time: 5.072
> Time: 4.934
> Time: 4.962
> 
> teo mainline:
> Time: 4.945
> Time: 5.021
> Time: 4.927
> Time: 4.923
> Time: 5.137
> 
> menu:
> Time: 4.991
> Time: 4.884
> Time: 4.880
> Time: 4.946
> Time: 4.980
> 
> tldr: all comparable, teo mainline slightly worse
> 
> Kind Regards,
> Christian
> 
> Christian Loehle (6):
>   cpuidle: teo: Increase util-threshold
>   cpuidle: teo: Don't stop tick on utilized
>   cpuidle: teo: Don't always stop tick on one state
>   cpuidle: teo: Increase minimum time to stop tick
>   cpuidle: teo: Remove recent intercepts metric
>   cpuidle: teo: Don't count non-existent intercepts
> 
>  drivers/cpuidle/governors/teo.c | 121 +++++++++++++-------------------
>  1 file changed, 48 insertions(+), 73 deletions(-)
> 
> --
> 2.34.1
>

Re: [PATCH 0/6] cpuidle: teo: fixes and improvements

Posted by Rafael J. Wysocki 1 year, 8 months ago

On Thu, Jun 6, 2024 at 1:59 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 6/6/24 10:00, Christian Loehle wrote:
> > Hi all,
> > so my investigation into teo lead to the following fixes and
> > improvements. Logically they are mostly independent, that's why this
> > cover letter is quite short, details are in the patches.
> >
> > 1/6:
> > As discussed, the utilization threshold is too high, while
> > there are benefits in certain workloads, there are quite a few
> > regressions, too.
> > 2/6:
> > Especially with the new util threshold, stopping tick makes little
> > sense when utilized is detected, so don't.
> > 3/6:
> > Particularly with WFI, even if it's the only state, stopping the tick
> > has benefits, so enable that in the early bail out.
> > 4/6:
> > Stopping the tick with 0 cost (if the idle state dictates it) is too
> > aggressive IMO, so add 1ms constant cost.
> > XXX: This has the issue of now being counted as idle_miss, so we could
> > consider adding this to the states, too, but the simple implementation
> > of this would have the downside that the cost is added to deeper states
> > even if the tick is already off.
> > 5/6:
> > Remove the 'recent' intercept logic, see my findings in:
> > https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/
> > I haven't found a way to salvage this properly, so I removed it.
> > The regular intercept seems to decay fast enough to not need this, but
> > we could change it if that turns out to be the case.
> > 6/6:
> > The rest of the intercept logic had issues, too.
> > See the commit.
> >
> > TODO: add some measurements of common workloads and some simple sanity
> > tests (like Vincent described in low utilization workloads if the
> > state selection looks reasonable).
> > I have some, but more (and more standardized) would be beneficial.
> >
> > Happy for anyone to take a look and test as well.
> >
> > Some numbers for context:
> > Maybe some numbers for context, I'll probably add them to the cover letter.
> >
> > Comparing:
> > - IO workload (intercept heavy).
> > - Timer workload very low utilization (check for deepest state)
> > - hackbench (high utilization)
> > all on RK3399 with CONFIG_HZ=100.
> > target_residencies: 1, 900, 2000
> >
> > 1. IO workload, 5 runs, results sorted, in read IOPS.
> > fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8;
> >
> > teo fixed:
> > /dev/nvme0n1
> > [4597, 4673, 4727, 4741, 4756]
> > /dev/mmcblk2
> > [5753, 5832, 5837, 5911, 5949]
> > /dev/mmcblk1
> > [2059, 2062, 2070, 2071, 2080]
> >
> > teo mainline:
> > /dev/nvme0n1
> > [3793, 3825, 3846, 3865, 3964]
> > /dev/mmcblk2
> > [3831, 4110, 4154, 4203, 4228]
> > /dev/mmcblk1
> > [1559, 1564, 1596, 1611, 1618]
> >
> > menu:
> > /dev/nvme0n1
> > [2571, 2630, 2804, 2813, 2917]
> > /dev/mmcblk2
> > [4181, 4260, 5062, 5260, 5329]
> > /dev/mmcblk1
> > [1567, 1581, 1585, 1603, 1769]
> >
> > 2. Timer workload (through IO for my convenience ;) )
> > Results in read IOPS, fio same as above.
> > echo "0 2097152 zero" | dmsetup create dm-zeros
> > echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow
> > (Each IO is delayed by timer of 50ms, should be mostly in state2)
> >
> > teo fixed:
> > 3269 cpu_idle total
> > 48 cpu_idle_miss
> > 30 cpu_idle_miss above
> > 18 cpu_idle_miss below
> >
> > teo mainline:
> > 3221 cpu_idle total
> > 1269 cpu_idle_miss
> > 22 cpu_idle_miss above
> > 1247 cpu_idle_miss below
> >
> > menu:
> > 3433 cpu_idle total
> > 114 cpu_idle_miss
> > 61 cpu_idle_miss above
> > 53 cpu_idle_miss below
> >
> > Residencies:
>
> Hmm, maybe actually including them would've been helpful too:
> (Over 5s workload, only showing LITTLE cluster)
> teo fixed:
> idle_state
> 2.0     4.813378
> -1.0    0.210820

Just to clarify, what does -1.0 mean here?

> 1.0     0.202778
> 0.0     0.062426
>
> teo mainline:
> idle_state
> 1.0     4.895766
> -1.0    0.098063
> 0.0     0.253069
>
> menu:
> idle_state
> 2.0     4.528356
> -1.0    0.241486
> 1.0     0.345829
> 0.0     0.202505
>
> >
> > tldr: overall teo fixed spends more time in state2 while having
> > fewer idle_miss than menu.
> > teo mainline was just way too aggressive at selecting shallow states.
> >
> > 3. Hackbench, 5 runs
> > for i in $(seq 0 4); do hackbench -l 100 -g 100 ; sleep 1; done
> >
> > teo fixed:
> > Time: 4.807
> > Time: 4.856
> > Time: 5.072
> > Time: 4.934
> > Time: 4.962
> >
> > teo mainline:
> > Time: 4.945
> > Time: 5.021
> > Time: 4.927
> > Time: 4.923
> > Time: 5.137
> >
> > menu:
> > Time: 4.991
> > Time: 4.884
> > Time: 4.880
> > Time: 4.946
> > Time: 4.980
> >
> > tldr: all comparable, teo mainline slightly worse
> >
> > Kind Regards,
> > Christian
> >
> > Christian Loehle (6):
> >   cpuidle: teo: Increase util-threshold
> >   cpuidle: teo: Don't stop tick on utilized
> >   cpuidle: teo: Don't always stop tick on one state
> >   cpuidle: teo: Increase minimum time to stop tick
> >   cpuidle: teo: Remove recent intercepts metric
> >   cpuidle: teo: Don't count non-existent intercepts
> >
> >  drivers/cpuidle/governors/teo.c | 121 +++++++++++++-------------------
> >  1 file changed, 48 insertions(+), 73 deletions(-)
> >
> > --
> > 2.34.1
> >
>
>

Re: [PATCH 0/6] cpuidle: teo: fixes and improvements

Posted by Christian Loehle 1 year, 8 months ago

On 6/6/24 13:29, Rafael J. Wysocki wrote:
> On Thu, Jun 6, 2024 at 1:59 PM Christian Loehle
> <christian.loehle@arm.com> wrote:
>>
>> On 6/6/24 10:00, Christian Loehle wrote:
>>> Hi all,
>>> so my investigation into teo lead to the following fixes and
>>> improvements. Logically they are mostly independent, that's why this
>>> cover letter is quite short, details are in the patches.
>>>
>>> 1/6:
>>> As discussed, the utilization threshold is too high, while
>>> there are benefits in certain workloads, there are quite a few
>>> regressions, too.
>>> 2/6:
>>> Especially with the new util threshold, stopping tick makes little
>>> sense when utilized is detected, so don't.
>>> 3/6:
>>> Particularly with WFI, even if it's the only state, stopping the tick
>>> has benefits, so enable that in the early bail out.
>>> 4/6:
>>> Stopping the tick with 0 cost (if the idle state dictates it) is too
>>> aggressive IMO, so add 1ms constant cost.
>>> XXX: This has the issue of now being counted as idle_miss, so we could
>>> consider adding this to the states, too, but the simple implementation
>>> of this would have the downside that the cost is added to deeper states
>>> even if the tick is already off.
>>> 5/6:
>>> Remove the 'recent' intercept logic, see my findings in:
>>> https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/
>>> I haven't found a way to salvage this properly, so I removed it.
>>> The regular intercept seems to decay fast enough to not need this, but
>>> we could change it if that turns out to be the case.
>>> 6/6:
>>> The rest of the intercept logic had issues, too.
>>> See the commit.
>>>
>>> TODO: add some measurements of common workloads and some simple sanity
>>> tests (like Vincent described in low utilization workloads if the
>>> state selection looks reasonable).
>>> I have some, but more (and more standardized) would be beneficial.
>>>
>>> Happy for anyone to take a look and test as well.
>>>
>>> Some numbers for context:
>>> Maybe some numbers for context, I'll probably add them to the cover letter.
>>>
>>> Comparing:
>>> - IO workload (intercept heavy).
>>> - Timer workload very low utilization (check for deepest state)
>>> - hackbench (high utilization)
>>> all on RK3399 with CONFIG_HZ=100.
>>> target_residencies: 1, 900, 2000
>>>
>>> 1. IO workload, 5 runs, results sorted, in read IOPS.
>>> fio --minimal --time_based --name=fiotest --filename=/dev/nvme0n1 --runtime=30 --rw=randread --bs=4k --ioengine=psync --iodepth=1 --direct=1 | cut -d \; -f 8;
>>>
>>> teo fixed:
>>> /dev/nvme0n1
>>> [4597, 4673, 4727, 4741, 4756]
>>> /dev/mmcblk2
>>> [5753, 5832, 5837, 5911, 5949]
>>> /dev/mmcblk1
>>> [2059, 2062, 2070, 2071, 2080]
>>>
>>> teo mainline:
>>> /dev/nvme0n1
>>> [3793, 3825, 3846, 3865, 3964]
>>> /dev/mmcblk2
>>> [3831, 4110, 4154, 4203, 4228]
>>> /dev/mmcblk1
>>> [1559, 1564, 1596, 1611, 1618]
>>>
>>> menu:
>>> /dev/nvme0n1
>>> [2571, 2630, 2804, 2813, 2917]
>>> /dev/mmcblk2
>>> [4181, 4260, 5062, 5260, 5329]
>>> /dev/mmcblk1
>>> [1567, 1581, 1585, 1603, 1769]
>>>
>>> 2. Timer workload (through IO for my convenience ;) )
>>> Results in read IOPS, fio same as above.
>>> echo "0 2097152 zero" | dmsetup create dm-zeros
>>> echo "0 2097152 delay /dev/mapper/dm-zeros 0 50" | dmsetup create dm-slow
>>> (Each IO is delayed by timer of 50ms, should be mostly in state2)
>>>
>>> teo fixed:
>>> 3269 cpu_idle total
>>> 48 cpu_idle_miss
>>> 30 cpu_idle_miss above
>>> 18 cpu_idle_miss below
>>>
>>> teo mainline:
>>> 3221 cpu_idle total
>>> 1269 cpu_idle_miss
>>> 22 cpu_idle_miss above
>>> 1247 cpu_idle_miss below
>>>
>>> menu:
>>> 3433 cpu_idle total
>>> 114 cpu_idle_miss
>>> 61 cpu_idle_miss above
>>> 53 cpu_idle_miss below
>>>
>>> Residencies:
>>
>> Hmm, maybe actually including them would've been helpful too:
>> (Over 5s workload, only showing LITTLE cluster)
>> teo fixed:
>> idle_state
>> 2.0     4.813378
>> -1.0    0.210820
> 
> Just to clarify, what does -1.0 mean here?

Good point, I should've mentioned.
This adds up the residencies just from the trace event cpu_idle.
-1 stands for PWR_EVENT_EXIT i.e. non-idle, it's quite useful
if we're talking absolute numbers but idle state ratios.
tldr: the time the CPU isn't in any idle state.

> [snip]