[v2] RE: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic

RE: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic

Posted by Doug Smythies 1 year, 7 months ago

Hi Christian,

Thank you for your reply.

On 2024.06.18 03:54 Christian Loehle wrote:
> On Sun, Jun 16, 2024 at 05:20:43PM -0700, Doug Smythies wrote:
>> On 2024.06.11 04:24 Christian Loehle wrote:
>>
>> ...
>> > Happy for anyone to take a look and test as well.
>> ...
>>
>> I tested the patch set.
>> I do a set of tests adopted over some years now.
>> Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more
detail.
>> One interesting observation is that everything seems to run much slower than the last time I did this, last August, Kernel
6.5-rc4.
>>
> 
> Thank you very much Doug, that is helpful indeed!
> 
>> Test system:
>> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (6 cores, 2 thread per core, 12 CPUs)
>> CPU Frequency scaling driver: intel_pstate
>> HWP (HardWare Pstate) control: Disabled
>> CPU frequency scaling governor: Performance
>> Idle states: 4: name : description:
>>    state0/name:POLL		desc:CPUIDLE CORE POLL IDLE
>>    state1/name:C1_ACPI		desc:ACPI FFH MWAIT 0x0
>>    state2/name:C2_ACPI		desc:ACPI FFH MWAIT 0x30
>>    state3/name:C3_ACPI		desc:ACPI FFH MWAIT 0x60
> 
> What are target residencies and exit latencies?

Of course. Here:

/sys/devices/system/cpu/cpu1/cpuidle/state0/residency:0
/sys/devices/system/cpu/cpu1/cpuidle/state1/residency:1
/sys/devices/system/cpu/cpu1/cpuidle/state2/residency:360
/sys/devices/system/cpu/cpu1/cpuidle/state3/residency:3102

/sys/devices/system/cpu/cpu1/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu1/cpuidle/state1/latency:1
/sys/devices/system/cpu/cpu1/cpuidle/state2/latency:120
/sys/devices/system/cpu/cpu1/cpuidle/state3/latency:1034
 
>> Ilde driver: intel_idle
>> Idle governor: as per individual test
>> Kernel: 6.10-rc2 and with V1 and V2 patch sets (1000 Hz tick rate)
>> Legend:
>>    teo: unmodified 6.10-rc2
>>    menu:
>>    ladder:
>>    cl: Kernel 6.10-rc2 + Christian Loehle patch set V1
>>    clv2: Kernel 6.10-rc2 + Christian Loehle patch set V2
>> System is extremely idle, other than the test work.
> 
> If you don't mind spinning up another one, I'd be very curious about
> results from just the Util-awareness revert (i.e. v2 1/3).
> If not I'll try to reproduce your tests.

I will, but not today.
I have never been a fan of Util-awareness.

>> Test 1: 2 core ping pong sweep:
>>
>> Pass a token between 2 CPUs on 2 different cores.
>> Do a variable amount of work at each stop.
> 
> Hard to interpret the results here, as state residencies would be the
> most useful one, but from the results I assume that residencies are
> calculated over all possible CPUs, so 4/6 CPUs are pretty much idle
> the entire time, resulting in >75% state3 residency overall.

It would be 10 of 12 CPUs are idle and 4 of 6 cores.
But fair enough, the residency stats are being dominated by the idle CPUs.
I usually look at the usage in conjunction with the residency percentages.
At 10 minutes (20 second sample period):
teo entered idle state 3 517 times ; clv2 was 1,979,541 times
At 20 minutes:
teo entered idle state 3 525 times ; clv2 was 3,011,263 times
Anyway, I could hack something to just use data from the 2 CPUs involved.

>> Purpose: To utilize the shallowest idle states
>> and observe the transition from using more of 1
>> idle state to another.
>>
>> Results relative to teo (negative is better):
>>		menu		ladder		clv2		cl
>> average	-2.09%		11.11%		2.88%		1.81%
>> max		10.63%		33.83%		9.45%		10.13%
>> min		-11.58%		6.25%		-3.61%		-3.34%
>>
>> While there are a few operating conditions where clv2 performs better than teo, overall it is worse.
>>
>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-relative.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-data.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/perf/
>>
>> Test 2: 6 core ping pong sweep:
>>
>> Pass a token between 6 CPUs on 6 different cores.
>> Do a variable amount of work at each stop.
>>
> 
> My first guess would've been that this is the perfect workload for the
> very low utilization threshold, but even teo has >40% state3 residency
> consistently here.

There are still 6 idle CPUs.
I'll try a 12 CPUs using each core twice type sweep test,
but I think I settled on 6 because it focused on what I wanted for results.

>> Purpose: To utilize the midrange idle states
>> and observe the transitions between use of
>> idle states.
>>
>> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
>> transitioning between much less power and slower performance and much more power and higher performance.
>> On either side of this area, the differences between all idle governors are negligible.
>> Only data from before this area (from results 1 t0 95) was included in the below results.
> 
> I see and agree with your interpretation. Difference in power between
> all tested seems to be negligible during that window. Interestingly
> the residencies of idle states seem to be very different, like ladder
> being mostly in deepest state3. Maybe total package power is too coarse
> to show the differences for this test.
> 
>> Results relative to teo (negative is better):
>>		menu	ladder	cl	clv2
>> average	0.16%	4.32%	2.54%	2.64%
>> max		0.92%	14.32%	8.78%	8.50%
>> min		-0.44%	0.27%	0.09%	0.05%
>>
>> One large clv2 difference seems to be excessive use of the deepest idle state,
>> with corresponding 100% hit rate on the "Idle State 3 was to deep" metric.
>> Example (20 second sample time):
>>
>> teo: Idle state 3 entries: 600, 74.33% were to deep or 451. Processor power was 38.0 watts.
>> clv2: Idle state 3 entries: 4,375,243, 100.00% were to deep or 4,375,243. Processor power was 40.6 watts.
>> clv2 loop times were about 8% worse than teo.
> 
> Some of the idle state 3 residencies seem to be >100% at the end here,
> not sure what's up with that.
 
The test is over and the system is completely idle.
And yes, there are 4 calculations that come out > 100%, the worst being 100.71%,
with a total sum over all idle states of 100.79%.
I can look into it if you want but have never expected the numbers to be that accurate.

>> Further details:
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-a.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-b.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data.png
>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/perf/
>>
>> Test 3: sleeping ebizzy - 128 threads.
>>
>> Purpose: This test has given interesting results in the past.
>> The test varies the sleep interval between record lookups.
>> The result is varying usage of idle states.
>>
>> Results: relative to teo (negative is better):
>>		menu	clv2	ladder	cl
>> average	0.06%	0.38%	0.81%	0.35%
>> max		2.53%	3.20%	5.00%	2.87%
>> min		-2.13%	-1.66%	-3.30%	-2.13%
>>
>> No strong conclusion here, from just the data.
>> However, clv2 seems to use a bit more processor power, on average.
> 
> Not sure about that, from the residencies ladder and teo should be
> decisive losers in terms of power. While later in the test teo seems
> to be getting worse in power it doesn't quite reflect the difference
> in states.
> E.g. clv2 finishing with 65% state2 residency while teo has 40%, but
> I'll try to get per-CPU power measurements on this one.
> Interestingly ladder is a clear winner if anything, if that is reliable
> as a result that could indicate a too aggressive tick stop from the
> other governors, but cl isn't that much better than clv2 here, even
> though it stops the tick less aggressively.

I agree with what you are saying.
It is a shorter test at only 25 minutes.
It might be worth trying the test again with more strict attention to
stabilizing the system thermally before each test.
The processor power will vary by a few watts for the exact same load
as a function of processor package temperature and coolant (my system is
water cooled) temperature and can take 20 to 30 minutes to settle.

Reference:
http://smythies.com/~doug/linux/idle/teo-util3/temperature/thermal-stabalization-time.png

>>
>> Further details:
> 
> Link is missing, but I found
> http://smythies.com/~doug/linux/idle/teo-util3/ebizzy/
> from browsing your page.

Yes, I accidently hit "Send" on my original email before it was actually finished.
But, then I was tired and thought "close enough".

>> Test4: adrestia wakeup latency tests. 500 threads.
>>
>> Purpose: The test was reported in 2023.09 by the kernel test robot and looked
>> both interesting and gave interesting results, so I added it to the tests I run.
> 
> http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/perf/
> So interestingly we can see, what I would call, the misbehavior of teo
> here, with teo skipping state2 and state3 entirely. You would expect
> a power regression here, but it doesn't translate into package power
> anyway.
> 
>>
>> Results:
>> teo:wakeup cost (periodic, 20us): 3130nSec reference
>> clv2:wakeup cost (periodic, 20us): 3179nSec +1.57%
>> cl:wakeup cost (periodic, 20us): 3206nSec +2.43%
>> menu:wakeup cost (periodic, 20us): 2933nSec -6.29%
>> ladder:wakeup cost (periodic, 20us): 3530nSec +12.78%
> 
> Is this measured as wakeup latency?
> I can't find much info about the test setup here, do you mind sharing
> something on it?

I admit to being vague on this one, and I'll need some time to review.
The notes I left for myself last September are here:
http://smythies.com/~doug/linux/idle/teo-util2/adrestia/README.txt
 
>> No strong conclusion here, from just the data.
>> However, clv2 seems to use a bit more processor power, on average.
>> teo: 69.72 watts
>> clv2: 72.91 watts +4.6%
>> Note 1: The first 5 minutes of the test powers were discarded to allow for thermal stabilization.

which might not have been long enough, see the thermal notes above.

>> Note 2: More work is required for this test, because the teo one actually took longer to execute, due to more outlier results
than the other tests.

>> There were several other tests run but are not written up herein.
>> 
> Because results are on par for all? Or inconclusive / not reproducible?

Yes, because nothing of significance was observed or the test was more or less a repeat of an already covered test.
Initially, I had a mistake in my baseline teo tests, and a couple of the not written up tests have still not been re-run with the
proper baseline.

> Some final words:
> I was hoping to get rid of Util-awareness with fixed the fixed intercept logic
> and my test showed that this isn't unreasonable.
> Here we do see a case where there is some regression vs Util-awareness.
> The intercept logic is currently decaying quite aggressively, maybe
> that could be tuned to improve teo behavior.
> 
> Kind Regards,
> Christian

... Doug

Re: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic

Posted by Christian Loehle 1 year, 7 months ago

On Tue, Jun 18, 2024 at 10:24:46AM -0700, Doug Smythies wrote:
> Hi Christian,
>
> Thank you for your reply.

Thank you for taking the time!

>
> On 2024.06.18 03:54 Christian Loehle wrote:
> > On Sun, Jun 16, 2024 at 05:20:43PM -0700, Doug Smythies wrote:
> >> On 2024.06.11 04:24 Christian Loehle wrote:
> >>
> >> ...
> >> > Happy for anyone to take a look and test as well.
> >> ...
> >>
> >> I tested the patch set.
> >> I do a set of tests adopted over some years now.
> >> Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in more
> detail.
> >> One interesting observation is that everything seems to run much slower than the last time I did this, last August, Kernel
> 6.5-rc4.
> >>
> >
> > Thank you very much Doug, that is helpful indeed!
> >
> >> Test system:
> >> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (6 cores, 2 thread per core, 12 CPUs)
> >> CPU Frequency scaling driver: intel_pstate
> >> HWP (HardWare Pstate) control: Disabled
> >> CPU frequency scaling governor: Performance
> >> Idle states: 4: name : description:
> >>    state0/name:POLL		desc:CPUIDLE CORE POLL IDLE
> >>    state1/name:C1_ACPI		desc:ACPI FFH MWAIT 0x0
> >>    state2/name:C2_ACPI		desc:ACPI FFH MWAIT 0x30
> >>    state3/name:C3_ACPI		desc:ACPI FFH MWAIT 0x60
> >
> > What are target residencies and exit latencies?
>
> Of course. Here:
>
> /sys/devices/system/cpu/cpu1/cpuidle/state0/residency:0
> /sys/devices/system/cpu/cpu1/cpuidle/state1/residency:1
> /sys/devices/system/cpu/cpu1/cpuidle/state2/residency:360
> /sys/devices/system/cpu/cpu1/cpuidle/state3/residency:3102
>
> /sys/devices/system/cpu/cpu1/cpuidle/state0/latency:0
> /sys/devices/system/cpu/cpu1/cpuidle/state1/latency:1
> /sys/devices/system/cpu/cpu1/cpuidle/state2/latency:120
> /sys/devices/system/cpu/cpu1/cpuidle/state3/latency:1034

Thanks,
what am I missing here that these are two different sets of states?

>
> >> Ilde driver: intel_idle
> >> Idle governor: as per individual test
> >> Kernel: 6.10-rc2 and with V1 and V2 patch sets (1000 Hz tick rate)
> >> Legend:
> >>    teo: unmodified 6.10-rc2
> >>    menu:
> >>    ladder:
> >>    cl: Kernel 6.10-rc2 + Christian Loehle patch set V1
> >>    clv2: Kernel 6.10-rc2 + Christian Loehle patch set V2
> >> System is extremely idle, other than the test work.
> >
> > If you don't mind spinning up another one, I'd be very curious about
> > results from just the Util-awareness revert (i.e. v2 1/3).
> > If not I'll try to reproduce your tests.
>
> I will, but not today.

Thank you.

> I have never been a fan of Util-awareness.

Well if you want to elaborate on that I guess now is the time and
here is the place. ;)

>
> >> Test 1: 2 core ping pong sweep:
> >>
> >> Pass a token between 2 CPUs on 2 different cores.
> >> Do a variable amount of work at each stop.
> >
> > Hard to interpret the results here, as state residencies would be the
> > most useful one, but from the results I assume that residencies are
> > calculated over all possible CPUs, so 4/6 CPUs are pretty much idle
> > the entire time, resulting in >75% state3 residency overall.
>
> It would be 10 of 12 CPUs are idle and 4 of 6 cores.

Of course, my bad.

> But fair enough, the residency stats are being dominated by the idle CPUs.
> I usually look at the usage in conjunction with the residency percentages.
> At 10 minutes (20 second sample period):
> teo entered idle state 3 517 times ; clv2 was 1,979,541 times
> At 20 minutes:
> teo entered idle state 3 525 times ; clv2 was 3,011,263 times
> Anyway, I could hack something to just use data from the 2 CPUs involved.

Your method works, just a bit awkward, I guess I'm spoiled in that
regard :)
(Shameless plug:
https://tooling.sites.arm.com/lisa/latest/trace_analysis.html#lisa.analysis.idle.IdleAnalysis.plot_cpu_idle_state_residency
)
>
> >> Purpose: To utilize the shallowest idle states
> >> and observe the transition from using more of 1
> >> idle state to another.
> >>
> >> Results relative to teo (negative is better):
> >>		menu		ladder		clv2		cl
> >> average	-2.09%		11.11%		2.88%		1.81%
> >> max		10.63%		33.83%		9.45%		10.13%
> >> min		-11.58%		6.25%		-3.61%		-3.34%
> >>
> >> While there are a few operating conditions where clv2 performs better than teo, overall it is worse.
> >>
> >> Further details:
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-relative.png
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-data.png
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/perf/
> >>
> >> Test 2: 6 core ping pong sweep:
> >>
> >> Pass a token between 6 CPUs on 6 different cores.
> >> Do a variable amount of work at each stop.
> >>
> >
> > My first guess would've been that this is the perfect workload for the
> > very low utilization threshold, but even teo has >40% state3 residency
> > consistently here.
>
> There are still 6 idle CPUs.
> I'll try a 12 CPUs using each core twice type sweep test,
> but I think I settled on 6 because it focused on what I wanted for results.

I see, again, my bad.

>
> >> Purpose: To utilize the midrange idle states
> >> and observe the transitions between use of
> >> idle states.
> >>
> >> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
> >> transitioning between much less power and slower performance and much more power and higher performance.
> >> On either side of this area, the differences between all idle governors are negligible.
> >> Only data from before this area (from results 1 t0 95) was included in the below results.
> >
> > I see and agree with your interpretation. Difference in power between
> > all tested seems to be negligible during that window. Interestingly
> > the residencies of idle states seem to be very different, like ladder
> > being mostly in deepest state3. Maybe total package power is too coarse
> > to show the differences for this test.
> >
> >> Results relative to teo (negative is better):
> >>		menu	ladder	cl	clv2
> >> average	0.16%	4.32%	2.54%	2.64%
> >> max		0.92%	14.32%	8.78%	8.50%
> >> min		-0.44%	0.27%	0.09%	0.05%
> >>
> >> One large clv2 difference seems to be excessive use of the deepest idle state,
> >> with corresponding 100% hit rate on the "Idle State 3 was to deep" metric.
> >> Example (20 second sample time):
> >>
> >> teo: Idle state 3 entries: 600, 74.33% were to deep or 451. Processor power was 38.0 watts.
> >> clv2: Idle state 3 entries: 4,375,243, 100.00% were to deep or 4,375,243. Processor power was 40.6 watts.
> >> clv2 loop times were about 8% worse than teo.
> >
> > Some of the idle state 3 residencies seem to be >100% at the end here,
> > not sure what's up with that.
>
> The test is over and the system is completely idle.
> And yes, there are 4 calculations that come out > 100%, the worst being 100.71%,
> with a total sum over all idle states of 100.79%.
> I can look into it if you want but have never expected the numbers to be that accurate.

Hopefully it's just some weird rounding thing, it just looks strange.

>
> >> Further details:
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-a.png
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-b.png
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data.png
> >> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/perf/
> >>
> >> Test 3: sleeping ebizzy - 128 threads.
> >>
> >> Purpose: This test has given interesting results in the past.
> >> The test varies the sleep interval between record lookups.
> >> The result is varying usage of idle states.
> >>
> >> Results: relative to teo (negative is better):
> >>		menu	clv2	ladder	cl
> >> average	0.06%	0.38%	0.81%	0.35%
> >> max		2.53%	3.20%	5.00%	2.87%
> >> min		-2.13%	-1.66%	-3.30%	-2.13%
> >>
> >> No strong conclusion here, from just the data.
> >> However, clv2 seems to use a bit more processor power, on average.
> >
> > Not sure about that, from the residencies ladder and teo should be
> > decisive losers in terms of power. While later in the test teo seems
> > to be getting worse in power it doesn't quite reflect the difference
> > in states.
> > E.g. clv2 finishing with 65% state2 residency while teo has 40%, but
> > I'll try to get per-CPU power measurements on this one.
> > Interestingly ladder is a clear winner if anything, if that is reliable
> > as a result that could indicate a too aggressive tick stop from the
> > other governors, but cl isn't that much better than clv2 here, even
> > though it stops the tick less aggressively.
>
> I agree with what you are saying.
> It is a shorter test at only 25 minutes.
> It might be worth trying the test again with more strict attention to
> stabilizing the system thermally before each test.
> The processor power will vary by a few watts for the exact same load
> as a function of processor package temperature and coolant (my system is
> water cooled) temperature and can take 20 to 30 minutes to settle.
>
> Reference:
> http://smythies.com/~doug/linux/idle/teo-util3/temperature/thermal-stabalization-time.png
>
> >>
> >> Further details:
> >
> > Link is missing, but I found
> > http://smythies.com/~doug/linux/idle/teo-util3/ebizzy/
> > from browsing your page.
>
> Yes, I accidently hit "Send" on my original email before it was actually finished.
> But, then I was tired and thought "close enough".
>
> >> Test4: adrestia wakeup latency tests. 500 threads.
> >>
> >> Purpose: The test was reported in 2023.09 by the kernel test robot and looked
> >> both interesting and gave interesting results, so I added it to the tests I run.
> >
> > http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/perf/
> > So interestingly we can see, what I would call, the misbehavior of teo
> > here, with teo skipping state2 and state3 entirely. You would expect
> > a power regression here, but it doesn't translate into package power
> > anyway.
> >
> >>
> >> Results:
> >> teo:wakeup cost (periodic, 20us): 3130nSec reference
> >> clv2:wakeup cost (periodic, 20us): 3179nSec +1.57%
> >> cl:wakeup cost (periodic, 20us): 3206nSec +2.43%
> >> menu:wakeup cost (periodic, 20us): 2933nSec -6.29%
> >> ladder:wakeup cost (periodic, 20us): 3530nSec +12.78%
> >
> > Is this measured as wakeup latency?
> > I can't find much info about the test setup here, do you mind sharing
> > something on it?
>
> I admit to being vague on this one, and I'll need some time to review.
> The notes I left for myself last September are here:
> http://smythies.com/~doug/linux/idle/teo-util2/adrestia/README.txt

Thanks!

>
> >> No strong conclusion here, from just the data.
> >> However, clv2 seems to use a bit more processor power, on average.
> >> teo: 69.72 watts
> >> clv2: 72.91 watts +4.6%
> >> Note 1: The first 5 minutes of the test powers were discarded to allow for thermal stabilization.
>
> which might not have been long enough, see the thermal notes above.
>
> >> Note 2: More work is required for this test, because the teo one actually took longer to execute, due to more outlier results
> than the other tests.
>
> >> There were several other tests run but are not written up herein.
> >>
> > Because results are on par for all? Or inconclusive / not reproducible?
>
> Yes, because nothing of significance was observed or the test was more or less a repeat of an already covered test.
> Initially, I had a mistake in my baseline teo tests, and a couple of the not written up tests have still not been re-run with the
> proper baseline.

Thank you for testing, that's what I hoped.

Kind Regards,
Christian

RE: [PATCHv2 0/3] cpuidle: teo: Fixing utilization and intercept logic

Posted by Doug Smythies 1 year, 7 months ago

Hi Christian,

It took awhile.

On 2024.06.20 04:19 Christian Loehle wrote:
> On Tue, Jun 18, 2024 at 10:24:46AM -0700, Doug Smythies wrote:
>> Hi Christian,
>>
>> Thank you for your reply.
> 
> Thank you for taking the time!
> 
>>
>> On 2024.06.18 03:54 Christian Loehle wrote:
>>> On Sun, Jun 16, 2024 at 05:20:43PM -0700, Doug Smythies wrote:
>>>> On 2024.06.11 04:24 Christian Loehle wrote:
>>>>
>>>> ...
>>>> > Happy for anyone to take a look and test as well.
>>>> ...
>>>>
>>>> I tested the patch set.
>>>> I do a set of tests adopted over some years now.
>>>> Readers may recall that some of the tests search over a wide range of operating conditions looking for areas to focus on in
more
>> detail.
>>>> One interesting observation is that everything seems to run much slower than the last time I did this, last August, Kernel
>> 6.5-rc4.
>>>>
>>>
>>> Thank you very much Doug, that is helpful indeed!
>>>
>>>> Test system:
>>>> Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz (6 cores, 2 thread per core, 12 CPUs)
>>>> CPU Frequency scaling driver: intel_pstate
>>>> HWP (HardWare Pstate) control: Disabled
>>>> CPU frequency scaling governor: Performance
>>>> Idle states: 4: name : description:
>>>>    state0/name:POLL		desc:CPUIDLE CORE POLL IDLE
>>>>    state1/name:C1_ACPI		desc:ACPI FFH MWAIT 0x0
>>>>    state2/name:C2_ACPI		desc:ACPI FFH MWAIT 0x30
>>>>    state3/name:C3_ACPI		desc:ACPI FFH MWAIT 0x60
>>>
>>> What are target residencies and exit latencies?
>>
>> Of course. Here:
>>
>> /sys/devices/system/cpu/cpu1/cpuidle/state0/residency:0
>> /sys/devices/system/cpu/cpu1/cpuidle/state1/residency:1
>> /sys/devices/system/cpu/cpu1/cpuidle/state2/residency:360
>> /sys/devices/system/cpu/cpu1/cpuidle/state3/residency:3102
>>
>> /sys/devices/system/cpu/cpu1/cpuidle/state0/latency:0
>> /sys/devices/system/cpu/cpu1/cpuidle/state1/latency:1
>> /sys/devices/system/cpu/cpu1/cpuidle/state2/latency:120
>> /sys/devices/system/cpu/cpu1/cpuidle/state3/latency:1034
> 
> Thanks,
> what am I missing here that these are two different sets of states?

I don't know what you are missing. Those are not two different sets of states.
Maybe I am missing something?

>>>> Ilde driver: intel_idle
>>>> Idle governor: as per individual test
>>>> Kernel: 6.10-rc2 and with V1 and V2 patch sets (1000 Hz tick rate)
>>>> Legend:
>>>>    teo: unmodified 6.10-rc2
>>>>    menu:
>>>>    ladder:
>>>>    cl: Kernel 6.10-rc2 + Christian Loehle patch set V1
>>>>    clv2: Kernel 6.10-rc2 + Christian Loehle patch set V2
             no-util: Kernel 6.10-rc2 + Christian Loehle [PATCHv2 1/3] Revert: "cpuidle: teo: Introduce util-awareness"
>>>> System is extremely idle, other than the test work.
>>>
>>> If you don't mind spinning up another one, I'd be very curious about
>>> results from just the Util-awareness revert (i.e. v2 1/3).
>>> If not I'll try to reproduce your tests.
>>
>> I will, but not today.

Most, if not all, links have been replaced adding "no-util" data.
Summary: there is negligible difference between "teo" and "no-util".
Isn't that what is expected for a system with 4 idle states?

Note 1: I forgot to change the date on several of the graphs.
 
> Thank you.
> 
>> I have never been a fan of Util-awareness.
> 
> Well if you want to elaborate on that I guess now is the time and
> here is the place. ;)

Most of my concerns with the original versions were fixed,
which is why it now has little to no effect on a system with 4 idle states.
Beyond that, I haven't had the time to review all of my old tests and findings.

>>>> Test 1: 2 core ping pong sweep:
>>>>
>>>> Pass a token between 2 CPUs on 2 different cores.
>>>> Do a variable amount of work at each stop.
>>>
>>> Hard to interpret the results here, as state residencies would be the
>>> most useful one, but from the results I assume that residencies are
>>> calculated over all possible CPUs, so 4/6 CPUs are pretty much idle
>>> the entire time, resulting in >75% state3 residency overall.
>>
>> It would be 10 of 12 CPUs are idle and 4 of 6 cores.
> 
> Of course, my bad.
> 
>> But fair enough, the residency stats are being dominated by the idle CPUs.
>> I usually look at the usage in conjunction with the residency percentages.
>> At 10 minutes (20 second sample period):
>> teo entered idle state 3 517 times ; clv2 was 1,979,541 times
>> At 20 minutes:
>> teo entered idle state 3 525 times ; clv2 was 3,011,263 times
>> Anyway, I could hack something to just use data from the 2 CPUs involved.
> 
> Your method works, just a bit awkward, I guess I'm spoiled in that
> regard :)
> (Shameless plug:
> https://tooling.sites.arm.com/lisa/latest/trace_analysis.html#lisa.analysis.idle.IdleAnalysis.plot_cpu_idle_state_residency
> )

Very interesting. If I ever get more time, I'll try it.

>>>> Purpose: To utilize the shallowest idle states
>>>> and observe the transition from using more of 1
>>>> idle state to another.
>>>>
>>>> Results relative to teo (negative is better):
	menu		ladder		clv2		cl		no-util
ave	-2.09%		11.11%		2.88%		1.81%		0.32%
max	10.63%		33.83%		9.45%		10.13%		8.00%
min	-11.58%	6.25%		-3.61%		-3.34%		-1.06%

Note 1: Old data re-stated with all the ">>>" stuff removed.
Note 2: The max +8.00% for no-util is misleading, as it was just a slight difference in a transition point.

>>>> While there are a few operating conditions where clv2 performs better than teo, overall it is worse.
>>>>
>>>> Further details:
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-relative.png
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/2-core-pp-data.png
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/2-1/perf/
>>>>
>>>> Test 2: 6 core ping pong sweep:
>>>>
>>>> Pass a token between 6 CPUs on 6 different cores.
>>>> Do a variable amount of work at each stop.
>>>>
>>>
>>> My first guess would've been that this is the perfect workload for the
>>> very low utilization threshold, but even teo has >40% state3 residency
>>> consistently here.
>>
>> There are still 6 idle CPUs.
>> I'll try a 12 CPUs using each core twice type sweep test,
>> but I think I settled on 6 because it focused on what I wanted for results.
> 
> I see, again, my bad.

I had a 12 CPU type test script already and have used it in the past. Anyway:

Results relative to teo (negative is better):
	no-util	menu	clv2
ave	0.07%	0.77%	1.41%
max	0.85%	2.78%	11.45%
min	-1.30%	-0.62%	0.00%

Note 1:	only test runs 1 to 120 are included, eliminating the bi-stable uncertainty region
	of the higher test runs.
Note 2: This test does show differences between teo and no-util in idle state usage in
	the bi-stable region. I do not know if it is repeatable.

Further details:
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/12-1/12-cpu-pp-data.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/12-1/12-cpu-pp-data-detail-a.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/12-1/12-cpu-pp-relative.png
http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/12-1/perf/
 
>>>> Purpose: To utilize the midrange idle states
>>>> and observe the transitions between use of
>>>> idle states.
>>>>
>>>> Note: This test has uncertainty in an area where the performance is bi-stable for all idle governors,
>>>> transitioning between much less power and slower performance and much more power and higher performance.
>>>> On either side of this area, the differences between all idle governors are negligible.
>>>> Only data from before this area (from results 1 t0 95) was included in the below results.
>>>
>>> I see and agree with your interpretation. Difference in power between
>>> all tested seems to be negligible during that window. Interestingly
>>> the residencies of idle states seem to be very different, like ladder
>>> being mostly in deepest state3. Maybe total package power is too coarse
>>> to show the differences for this test.
>>>
>>>> Results relative to teo (negative is better):
	menu	ladder	cl	clv2	no-util
ave	0.16%	4.32%	2.54%	2.64%	0.25%
max	0.92%	14.32%	8.78%	8.50%	14.96%
min	-0.44%	0.27%	0.09%	0.05%	-0.54%

Note 1: Old data re-stated with all the ">>>" stuff removed.
Note 2: The max 14.96% for no-util was the during test start.
	It is not always repeatable. See the dwell test results way further down below.

>>>> One large clv2 difference seems to be excessive use of the deepest idle state,
>>>> with corresponding 100% hit rate on the "Idle State 3 was to deep" metric.
>>>> Example (20 second sample time):
>>>>
>>>> teo: Idle state 3 entries: 600, 74.33% were to deep or 451. Processor power was 38.0 watts.
>>>> clv2: Idle state 3 entries: 4,375,243, 100.00% were to deep or 4,375,243. Processor power was 40.6 watts.
>>>> clv2 loop times were about 8% worse than teo.
>>>
>>> Some of the idle state 3 residencies seem to be >100% at the end here,
>>> not sure what's up with that.
>>
>> The test is over and the system is completely idle.
>> And yes, there are 4 calculations that come out > 100%, the worst being 100.71%,
>> with a total sum over all idle states of 100.79%.
>> I can look into it if you want but have never expected the numbers to be that accurate.
> 
> Hopefully it's just some weird rounding thing, it just looks strange.
> 
>>
>>>> Further details:
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-a.png
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data-detail-b.png
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/6-core-pp-data.png
>>>> http://smythies.com/~doug/linux/idle/teo-util3/ping-sweep/6-1/perf/
>>>>
>>>> Test 3: sleeping ebizzy - 128 threads.
>>>>
>>>> Purpose: This test has given interesting results in the past.
>>>> The test varies the sleep interval between record lookups.
>>>> The result is varying usage of idle states.
>>>>
>>>> Results: relative to teo (negative is better):
	menu	clv2	ladder	cl	no-util
ave	0.06%	0.38%	0.81%	0.35%	-0.03%
max	2.53%	3.20%	5.00%	2.87%	0.79%
min	-2.13%	-1.66%	-3.30%	-2.13%	-1.19%

Note 1: Old data re-stated with all the ">>>" stuff removed.

>>>> No strong conclusion here, from just the data.
>>>> However, clv2 seems to use a bit more processor power, on average.
>>>
>>> Not sure about that, from the residencies ladder and teo should be
>>> decisive losers in terms of power. While later in the test teo seems
>>> to be getting worse in power it doesn't quite reflect the difference
>>> in states.
>>> E.g. clv2 finishing with 65% state2 residency while teo has 40%, but
>>> I'll try to get per-CPU power measurements on this one.
>>> Interestingly ladder is a clear winner if anything, if that is reliable
>>> as a result that could indicate a too aggressive tick stop from the
>>> other governors, but cl isn't that much better than clv2 here, even
>>> though it stops the tick less aggressively.
>>
>> I agree with what you are saying.
>> It is a shorter test at only 25 minutes.
>> It might be worth trying the test again with more strict attention to
>> stabilizing the system thermally before each test.
>> The processor power will vary by a few watts for the exact same load
>> as a function of processor package temperature and coolant (my system is
>> water cooled) temperature and can take 20 to 30 minutes to settle.
>>
>> Reference:
>> http://smythies.com/~doug/linux/idle/teo-util3/temperature/thermal-stabalization-time.png
>>
>>>>
>>>> Further details:
>>>
>>> Link is missing, but I found
>>> http://smythies.com/~doug/linux/idle/teo-util3/ebizzy/
>>> from browsing your page.
>>
>> Yes, I accidently hit "Send" on my original email before it was actually finished.
>> But, then I was tired and thought "close enough".
>>
>>>> Test4: adrestia wakeup latency tests. 500 threads.
>>>>
>>>> Purpose: The test was reported in 2023.09 by the kernel test robot and looked
>>>> both interesting and gave interesting results, so I added it to the tests I run.
>>>
>>> http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/perf/
>>> So interestingly we can see, what I would call, the misbehavior of teo
>>> here, with teo skipping state2 and state3 entirely. You would expect
>>> a power regression here, but it doesn't translate into package power
>>> anyway.
>>>
>>>>
>>>> Results:
teo:wakeup cost (periodic, 20us): 3130nSec reference
clv2:wakeup cost (periodic, 20us): 3179nSec +1.57%
cl:wakeup cost (periodic, 20us): 3206nSec +2.43%
menu:wakeup cost (periodic, 20us): 2933nSec -6.29%
ladder:wakeup cost (periodic, 20us): 3530nSec +12.78%
no-util: wakeup cost (periodic, 20us): 3062nSec -2.17%

The really informative graph is this one:
http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/histogram-detail-a.png

Further details:
http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/histogram-detail-b.png
http://smythies.com/~doug/linux/idle/teo-util3/adrestia/periodic/perf/

>>>
>>> Is this measured as wakeup latency?
>>> I can't find much info about the test setup here, do you mind sharing
>>> something on it?
>>
>> I admit to being vague on this one, and I'll need some time to review.
>> The notes I left for myself last September are here:
>> http://smythies.com/~doug/linux/idle/teo-util2/adrestia/README.txt

Those notes have been updated but are still not very good.
There is bunch of system overhead in the "wakeup cost".

> 
> Thanks!
> 
>>
>>>> No strong conclusion here, from just the data.
>>>> However, clv2 seems to use a bit more processor power, on average.
>>>> teo: 69.72 watts
>>>> clv2: 72.91 watts +4.6%
>>>> Note 1: The first 5 minutes of the test powers were discarded to allow for thermal stabilization.
>>
>> which might not have been long enough, see the thermal notes above.
>>
>>>> Note 2: More work is required for this test, because the teo one actually took longer to execute, due to more outlier results
>> than the other tests.
>>
>>>> There were several other tests run but are not written up herein.
>>>>
>>> Because results are on par for all? Or inconclusive / not reproducible?
>>
>> Yes, because nothing of significance was observed or the test was more or less a repeat of an already covered test.
>> Initially, I had a mistake in my baseline teo tests, and a couple of the not written up tests have still not been re-run with the
>> proper baseline.
> 
> Thank you for testing, that's what I hoped.
> 
> Kind Regards,
> Christian

Results from a 6 core ping pong dwell test:

Note:	This is the same spot as the first data point from the above 6 core sweep test.
	It is important to note that the no-util results was not about +15% as above.

averages:

teo: 11.91786092 reference.
clv2: 12.94927586 +8.65%
cl: 12.89657797 +8.22%
menu: 11.85430331 -0.54%
ladder: 13.93532619 +17.08%
no-util: 11.93479453 +0.14%

Further details:
http://smythies.com/~doug/linux/idle/teo-util3/6-5000000-0/perf/

... Doug