[v6] Split iowait into two states

[PATCHSET v6 0/4] Split iowait into two states

Posted by Jens Axboe 1 year, 5 months ago

Hi,

This is v6 of the patchset where the current in_iowait state is split
into two parts:

1) The "task is sleeping waiting on IO", and would like cpufreq goodness
   in terms of sleep and wakeup latencies.
2) The above, and also accounted as such in the iowait stats.

The current ->in_iowait covers both, this series splits it into two types
of state so that each can be controlled seperately.

Patches 1..3 are prep patches, changing the type of
task_struct->nr_iowait and adding helpers to manipulate the iowait counts.

Patch 4 does the actual splitting.

This has been sitting for a while, would be nice to get this queued up
for 6.12. Comments welcome!

 arch/s390/appldata/appldata_base.c |  2 +-
 arch/s390/appldata/appldata_os.c   |  2 +-
 block/blk-cgroup.c                 |  2 +-
 fs/proc/stat.c                     |  2 +-
 include/linux/sched.h              | 10 ++++-
 include/linux/sched/stat.h         |  5 ++-
 kernel/locking/mutex.c             |  4 +-
 kernel/locking/rtmutex_api.c       |  4 +-
 kernel/sched/core.c                | 68 ++++++++++++++++++++++++------
 kernel/sched/cputime.c             |  3 +-
 kernel/sched/sched.h               |  5 ++-
 kernel/time/tick-sched.c           |  6 +--
 12 files changed, 81 insertions(+), 32 deletions(-)

Since v5:
- Make nr_iowait atomic_long_t unconditionally, as 32-bit archs have
  it as a 32-bit type. This avoids the ifdef stuff in sched/core.c.
  Thanks to Zhang Qiao for that suggestion.

-- 
Jens Axboe

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Pavel Begunkov 10 months, 1 week ago

On 8/19/24 16:39, Jens Axboe wrote:
> Hi,
> 
> This is v6 of the patchset where the current in_iowait state is split
> into two parts:
> 
> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>     in terms of sleep and wakeup latencies.
> 2) The above, and also accounted as such in the iowait stats.
> 
> The current ->in_iowait covers both, this series splits it into two types
> of state so that each can be controlled seperately.
> 
> Patches 1..3 are prep patches, changing the type of
> task_struct->nr_iowait and adding helpers to manipulate the iowait counts.
> 
> Patch 4 does the actual splitting.
> 
> This has been sitting for a while, would be nice to get this queued up
> for 6.12. Comments welcome!

Good day,

Did anything good happened with these patches or related work?
Christian?

Reminder: the goal is to let io_uring to keep using iowait boosting
but avoid reporting it in the iowait stats, because the jump in the
stat spooks users. I know at least several users carrying out of tree
patches to work it around. And, apparently, disabling the boosting
causes perf regressions.

I'm reading through the thread, but unless I missed something, it looks
like the patchset is actually aligned with future plans on iowait
mentioned in the thread, in a sense that it reduces the exposure to
the user space, and, when it's time, a better approach will be able
replaces it with no visible effect to the user.

On the other hand, there seems to be a work around io_uring patch
queued for, which I quite dislike from io_uring perspective but also
because it exposes even more of iowait to the user.
I can understand why it's there, it has been over a year since v1,
but maybe we can figure something out before it's released? Would
it be fine to have something similar to this series? Any other
ideas?

-- 
Pavel Begunkov

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Christian Loehle 10 months, 1 week ago

On 3/31/25 10:02, Pavel Begunkov wrote:
> On 8/19/24 16:39, Jens Axboe wrote:
>> Hi,
>>
>> This is v6 of the patchset where the current in_iowait state is split
>> into two parts:
>>
>> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>>     in terms of sleep and wakeup latencies.
>> 2) The above, and also accounted as such in the iowait stats.
>>
>> The current ->in_iowait covers both, this series splits it into two types
>> of state so that each can be controlled seperately.
>>
>> Patches 1..3 are prep patches, changing the type of
>> task_struct->nr_iowait and adding helpers to manipulate the iowait counts.
>>
>> Patch 4 does the actual splitting.
>>
>> This has been sitting for a while, would be nice to get this queued up
>> for 6.12. Comments welcome!
> 
> Good day,
> 
> Did anything good happened with these patches or related work?
> Christian> 

Hi Pavel,
so for cpuidle part we've had commit ("38f83090f515 cpuidle: menu: Remove iowait influence")
for a while now without much complaints, hopefully that means it stays in.
So I'd really like to know how the results still compare for relevant workloads.

cpufreq iowait boosting is still a thing in schedutil and intel_pstate,
and so far I've failed to convince Rafael and Peter to get rid of it.
I still think that is the right thing to do, but it does come with a
regression in most of the simple synthetic fio tests.

> Reminder: the goal is to let io_uring to keep using iowait boosting
> but avoid reporting it in the iowait stats, because the jump in the
> stat spooks users. I know at least several users carrying out of tree
> patches to work it around. And, apparently, disabling the boosting
> causes perf regressions.

Details would be appreciated, I looked the the postgres workload that
justified it initially and that was on cpuidle iowait which is no
longer a thing.

> 
> I'm reading through the thread, but unless I missed something, it looks
> like the patchset is actually aligned with future plans on iowait
> mentioned in the thread, in a sense that it reduces the exposure to
> the user space, and, when it's time, a better approach will be able
> replaces it with no visible effect to the user.

I'm not against $subject necessarily, it's clearly a hack tapering
over this but as I've mentioned I'm fine carrying a revert of $subject
for a future series on iowait boosting.

> 
> On the other hand, there seems to be a work around io_uring patch
> queued for, which I quite dislike from io_uring perspective but also
> because it exposes even more of iowait to the user.
> I can understand why it's there, it has been over a year since v1,
> but maybe we can figure something out before it's released? Would
> it be fine to have something similar to this series? Any other
> ideas?

Ah thank you, I've missed this
https://lore.kernel.org/io-uring/f548f142-d6f3-46d8-9c58-6cf595c968fb@kernel.dk/
Would be nice if this lead to more numbers comparing the two at least.

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Pavel Begunkov 10 months, 1 week ago

On 3/31/25 11:33, Christian Loehle wrote:
> On 3/31/25 10:02, Pavel Begunkov wrote:
>> On 8/19/24 16:39, Jens Axboe wrote:
>>> Hi,
>>>
>>> This is v6 of the patchset where the current in_iowait state is split
>>> into two parts:
>>>
>>> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>>>      in terms of sleep and wakeup latencies.
>>> 2) The above, and also accounted as such in the iowait stats.
>>>
>>> The current ->in_iowait covers both, this series splits it into two types
>>> of state so that each can be controlled seperately.
>>>
>>> Patches 1..3 are prep patches, changing the type of
>>> task_struct->nr_iowait and adding helpers to manipulate the iowait counts.
>>>
>>> Patch 4 does the actual splitting.
>>>
>>> This has been sitting for a while, would be nice to get this queued up
>>> for 6.12. Comments welcome!
>>
>> Good day,
>>
>> Did anything good happened with these patches or related work?
>> Christian>
> 
> Hi Pavel,
> so for cpuidle part we've had commit ("38f83090f515 cpuidle: menu: Remove iowait influence")
> for a while now without much complaints, hopefully that means it stays in.
> So I'd really like to know how the results still compare for relevant workloads.

Sounds great

> cpufreq iowait boosting is still a thing in schedutil and intel_pstate,
> and so far I've failed to convince Rafael and Peter to get rid of it.
> I still think that is the right thing to do, but it does come with a
> regression in most of the simple synthetic fio tests.

IOW, from the io_uring iowait stat problem perspective it got stuck
and is unlikely to move short term.

>> Reminder: the goal is to let io_uring to keep using iowait boosting
>> but avoid reporting it in the iowait stats, because the jump in the
>> stat spooks users. I know at least several users carrying out of tree
>> patches to work it around. And, apparently, disabling the boosting
>> causes perf regressions.
> 
> Details would be appreciated, I looked the the postgres workload that
> justified it initially and that was on cpuidle iowait which is no
> longer a thing.

I wasn't involved and afraid don't have any extra numbers.

>> I'm reading through the thread, but unless I missed something, it looks
>> like the patchset is actually aligned with future plans on iowait
>> mentioned in the thread, in a sense that it reduces the exposure to
>> the user space, and, when it's time, a better approach will be able
>> replaces it with no visible effect to the user.
> 
> I'm not against $subject necessarily, it's clearly a hack tapering
> over this but as I've mentioned I'm fine carrying a revert of $subject
> for a future series on iowait boosting.
> 
>>
>> On the other hand, there seems to be a work around io_uring patch
>> queued for, which I quite dislike from io_uring perspective but also
>> because it exposes even more of iowait to the user.
>> I can understand why it's there, it has been over a year since v1,
>> but maybe we can figure something out before it's released? Would
>> it be fine to have something similar to this series? Any other
>> ideas?
> 
> Ah thank you, I've missed this
> https://lore.kernel.org/io-uring/f548f142-d6f3-46d8-9c58-6cf595c968fb@kernel.dk/
> Would be nice if this lead to more numbers comparing the two at least.

Sure, but I'd rather avoid adding this type of a uapi just to test
it and solve the problem a different way after.

-- 
Pavel Begunkov

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Peter Zijlstra 1 year, 5 months ago

On Mon, Aug 19, 2024 at 09:39:45AM -0600, Jens Axboe wrote:
> Hi,
> 
> This is v6 of the patchset where the current in_iowait state is split
> into two parts:
> 
> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>    in terms of sleep and wakeup latencies.
> 2) The above, and also accounted as such in the iowait stats.
> 
> The current ->in_iowait covers both, this series splits it into two types
> of state so that each can be controlled seperately.

Yeah, but *WHY* !?!? I have some vague memories from last time around,
but patches should really keep this information.

> Patches 1..3 are prep patches, changing the type of
> task_struct->nr_iowait and adding helpers to manipulate the iowait counts.
> 
> Patch 4 does the actual splitting.
> 
> This has been sitting for a while, would be nice to get this queued up
> for 6.12. Comments welcome!

Ufff, and all this because menu-governor does something insane :-(

Rafael, why can't we simply remove this from menu? All the nr_iowait*()
users are basically broken and I would much rather fix broken rather
than work around broken like this.

That is, from where I'm sitting this all makes the io-wait situation far
worse instead of better.

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Rafael J. Wysocki 1 year, 5 months ago

On Wed, Sep 4, 2024 at 4:28 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Aug 19, 2024 at 09:39:45AM -0600, Jens Axboe wrote:
> > Hi,
> >
> > This is v6 of the patchset where the current in_iowait state is split
> > into two parts:
> >
> > 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
> >    in terms of sleep and wakeup latencies.
> > 2) The above, and also accounted as such in the iowait stats.
> >
> > The current ->in_iowait covers both, this series splits it into two types
> > of state so that each can be controlled seperately.
>
> Yeah, but *WHY* !?!? I have some vague memories from last time around,
> but patches should really keep this information.
>
> > Patches 1..3 are prep patches, changing the type of
> > task_struct->nr_iowait and adding helpers to manipulate the iowait counts.
> >
> > Patch 4 does the actual splitting.
> >
> > This has been sitting for a while, would be nice to get this queued up
> > for 6.12. Comments welcome!
>
> Ufff, and all this because menu-governor does something insane :-(
>
> Rafael, why can't we simply remove this from menu?

Same reason as before: people use it and refuse to stop.

But this is mostly about the schedutil cpufreq governor that uses
iowait boosting.

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Rafael J. Wysocki 1 year, 5 months ago

On Wed, Sep 4, 2024 at 4:42 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Sep 4, 2024 at 4:28 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Mon, Aug 19, 2024 at 09:39:45AM -0600, Jens Axboe wrote:
> > > Hi,
> > >
> > > This is v6 of the patchset where the current in_iowait state is split
> > > into two parts:
> > >
> > > 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
> > >    in terms of sleep and wakeup latencies.
> > > 2) The above, and also accounted as such in the iowait stats.
> > >
> > > The current ->in_iowait covers both, this series splits it into two types
> > > of state so that each can be controlled seperately.
> >
> > Yeah, but *WHY* !?!? I have some vague memories from last time around,
> > but patches should really keep this information.
> >
> > > Patches 1..3 are prep patches, changing the type of
> > > task_struct->nr_iowait and adding helpers to manipulate the iowait counts.
> > >
> > > Patch 4 does the actual splitting.
> > >
> > > This has been sitting for a while, would be nice to get this queued up
> > > for 6.12. Comments welcome!
> >
> > Ufff, and all this because menu-governor does something insane :-(
> >
> > Rafael, why can't we simply remove this from menu?
>
> Same reason as before: people use it and refuse to stop.
>
> But this is mostly about the schedutil cpufreq governor that uses
> iowait boosting.

To be more precise, there are two different uses of "iowait" in PM.

One is the nr_iowait_cpu() call in menu_select() and the result of it
is used for two purposes: (1) select different sets of statistics
depending on whether or not this number is zero and (2) set a limit
for the idle state's exit latency that depends on this number (but
note that it only takes effect when the "iowait" statistics are used
in the first place).  Both of these are arguably questionable and it
is unclear to me whether or not they actually help and how much.

The other use is boosting CPU frequency in schedutil and intel_pstate
if SCHED_CPUFREQ_IOWAIT is passed to them which in turn depends on the
p->in_iowait value in enqueue_task_fair().

AFAICS, the latter makes a major difference.

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Peter Zijlstra 1 year, 5 months ago

On Wed, Sep 04, 2024 at 05:18:57PM +0200, Rafael J. Wysocki wrote:

> To be more precise, there are two different uses of "iowait" in PM.
> 
> One is the nr_iowait_cpu() call in menu_select() and the result of it
> is used for two purposes: (1) select different sets of statistics
> depending on whether or not this number is zero and (2) set a limit
> for the idle state's exit latency that depends on this number (but
> note that it only takes effect when the "iowait" statistics are used
> in the first place).  Both of these are arguably questionable and it
> is unclear to me whether or not they actually help and how much.

So this one is very dubious, it relies on tasks getting back on the CPU
they went to sleep on -- not guaranteed at all.

> The other use is boosting CPU frequency in schedutil and intel_pstate
> if SCHED_CPUFREQ_IOWAIT is passed to them which in turn depends on the
> p->in_iowait value in enqueue_task_fair().

This one is fine and makes sense. At this point we know that p is going
to run and where it is going to run.

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Christian Loehle 1 year, 5 months ago

On 9/5/24 10:36, Peter Zijlstra wrote:
> On Wed, Sep 04, 2024 at 05:18:57PM +0200, Rafael J. Wysocki wrote:
> 
>> To be more precise, there are two different uses of "iowait" in PM.
>>
>> One is the nr_iowait_cpu() call in menu_select() and the result of it
>> is used for two purposes: (1) select different sets of statistics
>> depending on whether or not this number is zero and (2) set a limit
>> for the idle state's exit latency that depends on this number (but
>> note that it only takes effect when the "iowait" statistics are used
>> in the first place).  Both of these are arguably questionable and it
>> is unclear to me whether or not they actually help and how much.
> 
> So this one is very dubious, it relies on tasks getting back on the CPU
> they went to sleep on -- not guaranteed at all.
> 
>> The other use is boosting CPU frequency in schedutil and intel_pstate
>> if SCHED_CPUFREQ_IOWAIT is passed to them which in turn depends on the
>> p->in_iowait value in enqueue_task_fair().
> 
> This one is fine and makes sense. At this point we know that p is going
> to run and where it is going to run.

On any even remotely realistic scenario and hardware though the boost
isn't effective until the next enqueue-dequeue-cycle, so if your above
objection is based on that, I would object here too, using your argument.

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Peter Zijlstra 1 year, 5 months ago

On Thu, Sep 05, 2024 at 11:31:09AM +0100, Christian Loehle wrote:
> On 9/5/24 10:36, Peter Zijlstra wrote:
> > On Wed, Sep 04, 2024 at 05:18:57PM +0200, Rafael J. Wysocki wrote:
> > 
> >> To be more precise, there are two different uses of "iowait" in PM.
> >>
> >> One is the nr_iowait_cpu() call in menu_select() and the result of it
> >> is used for two purposes: (1) select different sets of statistics
> >> depending on whether or not this number is zero and (2) set a limit
> >> for the idle state's exit latency that depends on this number (but
> >> note that it only takes effect when the "iowait" statistics are used
> >> in the first place).  Both of these are arguably questionable and it
> >> is unclear to me whether or not they actually help and how much.
> > 
> > So this one is very dubious, it relies on tasks getting back on the CPU
> > they went to sleep on -- not guaranteed at all.
> > 
> >> The other use is boosting CPU frequency in schedutil and intel_pstate
> >> if SCHED_CPUFREQ_IOWAIT is passed to them which in turn depends on the
> >> p->in_iowait value in enqueue_task_fair().
> > 
> > This one is fine and makes sense. At this point we know that p is going
> > to run and where it is going to run.
> 
> On any even remotely realistic scenario and hardware though the boost
> isn't effective until the next enqueue-dequeue-cycle, so if your above
> objection is based on that, I would object here too, using your argument.

That is a quality of implementation issue with schedutil no?

The whole notion that the wait was for feeding external hardware, and
thus the normal utilization metric doesn't work right thing is still
valid.

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Christian Loehle 1 year, 5 months ago

On 9/5/24 12:00, Peter Zijlstra wrote:
> On Thu, Sep 05, 2024 at 11:31:09AM +0100, Christian Loehle wrote:
>> On 9/5/24 10:36, Peter Zijlstra wrote:
>>> On Wed, Sep 04, 2024 at 05:18:57PM +0200, Rafael J. Wysocki wrote:
>>>
>>>> To be more precise, there are two different uses of "iowait" in PM.
>>>>
>>>> One is the nr_iowait_cpu() call in menu_select() and the result of it
>>>> is used for two purposes: (1) select different sets of statistics
>>>> depending on whether or not this number is zero and (2) set a limit
>>>> for the idle state's exit latency that depends on this number (but
>>>> note that it only takes effect when the "iowait" statistics are used
>>>> in the first place).  Both of these are arguably questionable and it
>>>> is unclear to me whether or not they actually help and how much.
>>>
>>> So this one is very dubious, it relies on tasks getting back on the CPU
>>> they went to sleep on -- not guaranteed at all.
>>>
>>>> The other use is boosting CPU frequency in schedutil and intel_pstate
>>>> if SCHED_CPUFREQ_IOWAIT is passed to them which in turn depends on the
>>>> p->in_iowait value in enqueue_task_fair().
>>>
>>> This one is fine and makes sense. At this point we know that p is going
>>> to run and where it is going to run.
>>
>> On any even remotely realistic scenario and hardware though the boost
>> isn't effective until the next enqueue-dequeue-cycle, so if your above
>> objection is based on that, I would object here too, using your argument.
> 
> That is a quality of implementation issue with schedutil no?

Is it? So there is a latency from requesting a new frequency and actually
running on it, for both x86 and arm platforms out there that should still
be a few usecs at least during which the task is running. The task will
dequeue quite soon (otherwise it will build up utilization and then it's
not one we consider problematic wrt to this io utilization problem anyway).
Just to be clear, I'm assuming fast_switch here and then I think schedutil's
implementation isn't the problem, rather the premise of the underlying
problem is.
I have tried to elaborate on that in the RFC I've posted and linked though.

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Christian Loehle 1 year, 5 months ago

On 9/4/24 16:18, Rafael J. Wysocki wrote:
> On Wed, Sep 4, 2024 at 4:42 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>
>> On Wed, Sep 4, 2024 at 4:28 PM Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Mon, Aug 19, 2024 at 09:39:45AM -0600, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> This is v6 of the patchset where the current in_iowait state is split
>>>> into two parts:
>>>>
>>>> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>>>>    in terms of sleep and wakeup latencies.
>>>> 2) The above, and also accounted as such in the iowait stats.
>>>>
>>>> The current ->in_iowait covers both, this series splits it into two types
>>>> of state so that each can be controlled seperately.
>>>
>>> Yeah, but *WHY* !?!? I have some vague memories from last time around,
>>> but patches should really keep this information.
>>>
>>>> Patches 1..3 are prep patches, changing the type of
>>>> task_struct->nr_iowait and adding helpers to manipulate the iowait counts.
>>>>
>>>> Patch 4 does the actual splitting.
>>>>
>>>> This has been sitting for a while, would be nice to get this queued up
>>>> for 6.12. Comments welcome!
>>>
>>> Ufff, and all this because menu-governor does something insane :-(
>>>
>>> Rafael, why can't we simply remove this from menu?
>>
>> Same reason as before: people use it and refuse to stop.
>>
>> But this is mostly about the schedutil cpufreq governor that uses
>> iowait boosting.
> 
> To be more precise, there are two different uses of "iowait" in PM.
> 
> One is the nr_iowait_cpu() call in menu_select() and the result of it
> is used for two purposes: (1) select different sets of statistics
> depending on whether or not this number is zero and (2) set a limit
> for the idle state's exit latency that depends on this number (but
> note that it only takes effect when the "iowait" statistics are used
> in the first place).  Both of these are arguably questionable and it
> is unclear to me whether or not they actually help and how much.

So from my perspective it doesn't, not significantly to justify it's
existence anyway. Either it doesn't actually matter for menu, or teo
is able to compete / outperform without relying on it.
Some caution is advised though this really depends on:
- Which idle states are available for the kernel to select.
- How accurate the kernel's view of the idle states is.

Both varies wildly.

> 
> The other use is boosting CPU frequency in schedutil and intel_pstate
> if SCHED_CPUFREQ_IOWAIT is passed to them which in turn depends on the
> p->in_iowait value in enqueue_task_fair().
> 
> AFAICS, the latter makes a major difference.


Indeed, fortunately the impact is quite limited here.
But please, Rafael, Jens and Peter, feel free to share your comments
over here too:

https://lore.kernel.org/lkml/20240905092645.2885200-1-christian.loehle@arm.com/

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Rafael J. Wysocki 1 year, 5 months ago

On Thu, Sep 5, 2024 at 11:29 AM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 9/4/24 16:18, Rafael J. Wysocki wrote:
> > On Wed, Sep 4, 2024 at 4:42 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >>
> >> On Wed, Sep 4, 2024 at 4:28 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >>>
> >>> On Mon, Aug 19, 2024 at 09:39:45AM -0600, Jens Axboe wrote:
> >>>> Hi,
> >>>>
> >>>> This is v6 of the patchset where the current in_iowait state is split
> >>>> into two parts:
> >>>>
> >>>> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
> >>>>    in terms of sleep and wakeup latencies.
> >>>> 2) The above, and also accounted as such in the iowait stats.
> >>>>
> >>>> The current ->in_iowait covers both, this series splits it into two types
> >>>> of state so that each can be controlled seperately.
> >>>
> >>> Yeah, but *WHY* !?!? I have some vague memories from last time around,
> >>> but patches should really keep this information.
> >>>
> >>>> Patches 1..3 are prep patches, changing the type of
> >>>> task_struct->nr_iowait and adding helpers to manipulate the iowait counts.
> >>>>
> >>>> Patch 4 does the actual splitting.
> >>>>
> >>>> This has been sitting for a while, would be nice to get this queued up
> >>>> for 6.12. Comments welcome!
> >>>
> >>> Ufff, and all this because menu-governor does something insane :-(
> >>>
> >>> Rafael, why can't we simply remove this from menu?
> >>
> >> Same reason as before: people use it and refuse to stop.
> >>
> >> But this is mostly about the schedutil cpufreq governor that uses
> >> iowait boosting.
> >
> > To be more precise, there are two different uses of "iowait" in PM.
> >
> > One is the nr_iowait_cpu() call in menu_select() and the result of it
> > is used for two purposes: (1) select different sets of statistics
> > depending on whether or not this number is zero and (2) set a limit
> > for the idle state's exit latency that depends on this number (but
> > note that it only takes effect when the "iowait" statistics are used
> > in the first place).  Both of these are arguably questionable and it
> > is unclear to me whether or not they actually help and how much.
>
> So from my perspective it doesn't, not significantly to justify it's
> existence anyway. Either it doesn't actually matter for menu, or teo
> is able to compete / outperform without relying on it.

Thanks for this feedback!

I'm actually going to try to remove that stuff from menu and see if
anyone cries bloody murder.

> Some caution is advised though this really depends on:
> - Which idle states are available for the kernel to select.
> - How accurate the kernel's view of the idle states is.
>
> Both varies wildly.

True, but let's see what the feedback is.

> > The other use is boosting CPU frequency in schedutil and intel_pstate
> > if SCHED_CPUFREQ_IOWAIT is passed to them which in turn depends on the
> > p->in_iowait value in enqueue_task_fair().
> >
> > AFAICS, the latter makes a major difference.
>
>
> Indeed, fortunately the impact is quite limited here.
> But please, Rafael, Jens and Peter, feel free to share your comments
> over here too:
>
> https://lore.kernel.org/lkml/20240905092645.2885200-1-christian.loehle@arm.com/

I will.

Thanks!

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Jens Axboe 1 year, 5 months ago

On 9/4/24 8:28 AM, Peter Zijlstra wrote:
> On Mon, Aug 19, 2024 at 09:39:45AM -0600, Jens Axboe wrote:
>> Hi,
>>
>> This is v6 of the patchset where the current in_iowait state is split
>> into two parts:
>>
>> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>>    in terms of sleep and wakeup latencies.
>> 2) The above, and also accounted as such in the iowait stats.
>>
>> The current ->in_iowait covers both, this series splits it into two types
>> of state so that each can be controlled seperately.
> 
> Yeah, but *WHY* !?!? I have some vague memories from last time around,
> but patches should really keep this information.

To decouple the frequency boost on short waits from the accounting side,
as lots of tooling equates iowait time with busy time and reports it as
such. Yeah that's garbage and a reporting issue, but decades of
education hasn't really improved on that. We should've dumped iowait
once we moved away from 1-2 processor system or had preemptible kernels,
but alas we did not and here we are in 2024.

>> Patches 1..3 are prep patches, changing the type of
>> task_struct->nr_iowait and adding helpers to manipulate the iowait counts.
>>
>> Patch 4 does the actual splitting.
>>
>> This has been sitting for a while, would be nice to get this queued up
>> for 6.12. Comments welcome!
> 
> Ufff, and all this because menu-governor does something insane :-(
> 
> Rafael, why can't we simply remove this from menu? All the nr_iowait*()
> users are basically broken and I would much rather fix broken rather
> than work around broken like this.
> 
> That is, from where I'm sitting this all makes the io-wait situation far
> worse instead of better.

IMHO what we need is a way to propagate expected wait times for a
sleeper. Right now iowait serves this purpose in a very crude way, in
that it doesn't really tell you the expected wait, just that it's a
short one.

If we simply remove iowait frequency boosting, then we'll have big
regressions particularly for low/sync storage IO.

-- 
Jens Axboe

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Peter Zijlstra 1 year, 5 months ago

On Wed, Sep 04, 2024 at 08:41:23AM -0600, Jens Axboe wrote:

> > Yeah, but *WHY* !?!? I have some vague memories from last time around,
> > but patches should really keep this information.
> 
> To decouple the frequency boost on short waits from the accounting side,
> as lots of tooling equates iowait time with busy time and reports it as
> such. Yeah that's garbage and a reporting issue, but decades of
> education hasn't really improved on that. We should've dumped iowait
> once we moved away from 1-2 processor system or had preemptible kernels,
> but alas we did not and here we are in 2024.

There's 'WAIT' in the name, what broken piece of garbage reports it as
busy time? That has *NEVER* been right. Even on UP systems where IO-wait
is actually a sensible number, it is explicitly the time it *could* have
been busy, if only the IO were faster.

And are we really going to make the whole kernel situation worse just
because there's a bunch of broken userspace?

> >> Patches 1..3 are prep patches, changing the type of
> >> task_struct->nr_iowait and adding helpers to manipulate the iowait counts.
> >>
> >> Patch 4 does the actual splitting.
> >>
> >> This has been sitting for a while, would be nice to get this queued up
> >> for 6.12. Comments welcome!
> > 
> > Ufff, and all this because menu-governor does something insane :-(
> > 
> > Rafael, why can't we simply remove this from menu? All the nr_iowait*()
> > users are basically broken and I would much rather fix broken rather
> > than work around broken like this.
> > 
> > That is, from where I'm sitting this all makes the io-wait situation far
> > worse instead of better.
> 
> IMHO what we need is a way to propagate expected wait times for a
> sleeper. Right now iowait serves this purpose in a very crude way, in
> that it doesn't really tell you the expected wait, just that it's a
> short one.

Expected wait time is one thing, but you then *still* have no clue what
CPU it will get back on. Very typically it will be another CPU in the
same cache cluster. One that had no consideration of it when it went to
sleep.

A sleeping task is not associated with a CPU. There is a fundamental
mismatch there.

Using io-wait for idle state selection is very tricky because of this.

> If we simply remove iowait frequency boosting, then we'll have big
> regressions particularly for low/sync storage IO.

The frequency boosting thing I don't object to. That happend on wakeup
after we know that and where a task is going to run.

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Jens Axboe 1 year, 5 months ago

On 9/4/24 8:41 AM, Jens Axboe wrote:
> On 9/4/24 8:28 AM, Peter Zijlstra wrote:
>> On Mon, Aug 19, 2024 at 09:39:45AM -0600, Jens Axboe wrote:
>>> Hi,
>>>
>>> This is v6 of the patchset where the current in_iowait state is split
>>> into two parts:
>>>
>>> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>>>    in terms of sleep and wakeup latencies.
>>> 2) The above, and also accounted as such in the iowait stats.
>>>
>>> The current ->in_iowait covers both, this series splits it into two types
>>> of state so that each can be controlled seperately.
>>
>> Yeah, but *WHY* !?!? I have some vague memories from last time around,
>> but patches should really keep this information.
> 
> To decouple the frequency boost on short waits from the accounting side,
> as lots of tooling equates iowait time with busy time and reports it as
> such. Yeah that's garbage and a reporting issue, but decades of
> education hasn't really improved on that. We should've dumped iowait
> once we moved away from 1-2 processor system or had preemptible kernels,
> but alas we did not and here we are in 2024.

Forgot to mention, it's not *just* an educational thing - lots services
of services do mixed network and disk IO, obviously, and they do have
some interest in retaining iowait metrics on the disk side.

-- 
Jens Axboe

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Christian Loehle 1 year, 5 months ago

On 8/19/24 16:39, Jens Axboe wrote:
> Hi,
> 
> This is v6 of the patchset where the current in_iowait state is split
> into two parts:
> 
> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>    in terms of sleep and wakeup latencies.
> 2) The above, and also accounted as such in the iowait stats.
> 
> The current ->in_iowait covers both, this series splits it into two types
> of state so that each can be controlled seperately.

Hi Jens,
I wanted to give a brief update on where I think we're at in terms
of iowait behavior regarding cpuidle and cpufreq.
I'm still working on getting both removed, given the discussions had
on the list [0] and at OSPM [1] this seems realistic and the best way
forward IMO.
That would then naturally make this series and the iowait workaround in
io_uring/io_uring.c unnecessary.

1. For cpuidle:
Main issue with relying on nr_iowaiters is that there is no guarantee
whatsoever that these tasks will wakeup where they went to sleep so if
we can achieve the same throughput without nr_iowaiters it shouldn't
be relevant.
I spent quite some time in fixing teo [2], because untangling nr_iowaiters
from menu seems hard, essentially nobody has worked on menu seriously for
a while now. Thus the plan here is to replace menu by teo eventually.
For your io_uring workloads I see throughput on par for teo (doesn't rely
on iowait) and menu.

# echo teo > /sys/devices/system/cpu/cpuidle/current_governor
#  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
submitter=0, tid=206, file=/dev/nvme0n1, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
Engine=preadv2
IOPS=22500, BW=87MiB/s, IOS/call=0/0
IOPS=21916, BW=85MiB/s, IOS/call=1/0
IOPS=21774, BW=85MiB/s, IOS/call=1/0
IOPS=22467, BW=87MiB/s, IOS/call=1/0
Exiting on timeout
Maximum IOPS=22500
# echo menu > /sys/devices/system/cpu/cpuidle/current_governor
[  178.754571] cpuidle: using governor menu
#  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
submitter=0, tid=209, file=/dev/nvme0n1, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
Engine=preadv2
IOPS=21452, BW=83MiB/s, IOS/call=0/0
IOPS=21778, BW=85MiB/s, IOS/call=1/0
IOPS=21120, BW=82MiB/s, IOS/call=1/0
IOPS=20903, BW=81MiB/s, IOS/call=1/0
Exiting on timeout
Maximum IOPS=21778

Please do give it a try for yourself as well!

2. For cpufreq:
Main issue for IO-bound workloads with iowait boosting is we're punishing
the 'good' workloads (that don't have iowait sleeps in their throughput-critical
part, which is already bad because of the scheduling overhead induced) by
making them energy-inefficient to make synthetic benchmarks happy.
A study of more realistic workloads show that they don't suffer from a problem
of building up utilization, not util_est anyway, so they don't actually benefit
from a cpufreq boost.
This leads me to the conclusion that cpufreq iowait boosting can be scrapped
altogether if we accept some degradation of benchmarks like
./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1
or
fio --name=fio --rw=randread --bs=4k --runtime=5 --time_based --filename=/dev/nvme0n1 --iodepth=1 --numjobs=1
(non-io_uring) for that matter.

For io_uring where the expected case is probably not single-threaded sync IO
(or iodepth=1) the cpufreq iowait boost is just hurting use-cases by pushing
it to less efficient frequencies that might not be needed.

I know you want your problem (io_uring showing up as 100% busy even though it's
just sleeping) to be solved like yesterday and my opinion on a future timeline
might not be enough to convince you of much. I wanted to share it anyway.
I don't see an issue with the actual code you're proposing, but it does feel
like a step in the wrong direction to me.

[0] https://lore.kernel.org/lkml/20240304201625.100619-1-christian.loehle@arm.com/
v2: https://lore.kernel.org/lkml/20240518113947.2127802-1-christian.loehle@arm.com/
[1] https://www.youtube.com/watch?v=MSQGEsSziZ4
[2] https://lore.kernel.org/lkml/20240628095955.34096-1-christian.loehle@arm.com/

Regards,
Christian

> [snip]

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Jens Axboe 1 year, 5 months ago

On 8/21/24 8:54 AM, Christian Loehle wrote:
> On 8/19/24 16:39, Jens Axboe wrote:
>> Hi,
>>
>> This is v6 of the patchset where the current in_iowait state is split
>> into two parts:
>>
>> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>>    in terms of sleep and wakeup latencies.
>> 2) The above, and also accounted as such in the iowait stats.
>>
>> The current ->in_iowait covers both, this series splits it into two types
>> of state so that each can be controlled seperately.
> 
> Hi Jens,
> I wanted to give a brief update on where I think we're at in terms
> of iowait behavior regarding cpuidle and cpufreq.
> I'm still working on getting both removed, given the discussions had
> on the list [0] and at OSPM [1] this seems realistic and the best way
> forward IMO.
> That would then naturally make this series and the iowait workaround in
> io_uring/io_uring.c unnecessary.
> 
> 1. For cpuidle:
> Main issue with relying on nr_iowaiters is that there is no guarantee
> whatsoever that these tasks will wakeup where they went to sleep so if
> we can achieve the same throughput without nr_iowaiters it shouldn't
> be relevant.
> I spent quite some time in fixing teo [2], because untangling nr_iowaiters
> from menu seems hard, essentially nobody has worked on menu seriously for
> a while now. Thus the plan here is to replace menu by teo eventually.
> For your io_uring workloads I see throughput on par for teo (doesn't rely
> on iowait) and menu.
> 
> # echo teo > /sys/devices/system/cpu/cpuidle/current_governor
> #  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
> submitter=0, tid=206, file=/dev/nvme0n1, node=-1
> polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
> Engine=preadv2
> IOPS=22500, BW=87MiB/s, IOS/call=0/0
> IOPS=21916, BW=85MiB/s, IOS/call=1/0
> IOPS=21774, BW=85MiB/s, IOS/call=1/0
> IOPS=22467, BW=87MiB/s, IOS/call=1/0
> Exiting on timeout
> Maximum IOPS=22500
> # echo menu > /sys/devices/system/cpu/cpuidle/current_governor
> [  178.754571] cpuidle: using governor menu
> #  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
> submitter=0, tid=209, file=/dev/nvme0n1, node=-1
> polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
> Engine=preadv2
> IOPS=21452, BW=83MiB/s, IOS/call=0/0
> IOPS=21778, BW=85MiB/s, IOS/call=1/0
> IOPS=21120, BW=82MiB/s, IOS/call=1/0
> IOPS=20903, BW=81MiB/s, IOS/call=1/0
> Exiting on timeout
> Maximum IOPS=21778
> 
> Please do give it a try for yourself as well!
> 
> 2. For cpufreq:
> Main issue for IO-bound workloads with iowait boosting is we're punishing
> the 'good' workloads (that don't have iowait sleeps in their throughput-critical
> part, which is already bad because of the scheduling overhead induced) by
> making them energy-inefficient to make synthetic benchmarks happy.
> A study of more realistic workloads show that they don't suffer from a problem
> of building up utilization, not util_est anyway, so they don't actually benefit
> from a cpufreq boost.
> This leads me to the conclusion that cpufreq iowait boosting can be scrapped
> altogether if we accept some degradation of benchmarks like
> ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1
> or
> fio --name=fio --rw=randread --bs=4k --runtime=5 --time_based --filename=/dev/nvme0n1 --iodepth=1 --numjobs=1
> (non-io_uring) for that matter.

The original iowait addition came because a big regression was seen
compared to not setting iowait, it was around 20% iirc. That's big, and
not in the realm of "some degradation" that will be acceptable. And that
will largely depend on the system being used. On some systems, it'll be
less, and on some it'll be more.

> For io_uring where the expected case is probably not single-threaded
> sync IO (or iodepth=1) the cpufreq iowait boost is just hurting
> use-cases by pushing it to less efficient frequencies that might not
> be needed.

People do all sorts of things, and sync (or low queue depth) IO is
certainly one of the use cases. In fact that's where the above report
came from, on the postgres aio side.

> I know you want your problem (io_uring showing up as 100% busy even
> though it's just sleeping) to be solved like yesterday and my opinion
> on a future timeline might not be enough to convince you of much. I
> wanted to share it anyway. I don't see an issue with the actual code
> you're proposing, but it does feel like a step in the wrong direction
> to me.

As mentioned in my original reply, I view this as entirely orthogonal,
and while I appreciate your efforts in this area, I'm a little tired of
this being brought up as a gatekeeping metric when it's not there.

If we can eliminate iowait for boosting down the line, then I'm all for
it. But this has now been pending for > 6 months and I don't think it's
far to keep stringing this along on a future promise. This isn't a lot
of code and it solves the issue for now, if the code will get removed
down the line as not needed, then that's certainly fine. For now, we
need it.

-- 
Jens Axboe

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Christian Loehle 1 year, 5 months ago

On 8/21/24 16:04, Jens Axboe wrote:
> On 8/21/24 8:54 AM, Christian Loehle wrote:
>> On 8/19/24 16:39, Jens Axboe wrote:
>>> Hi,
>>>
>>> This is v6 of the patchset where the current in_iowait state is split
>>> into two parts:
>>>
>>> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>>>    in terms of sleep and wakeup latencies.
>>> 2) The above, and also accounted as such in the iowait stats.
>>>
>>> The current ->in_iowait covers both, this series splits it into two types
>>> of state so that each can be controlled seperately.
>>
>> Hi Jens,
>> I wanted to give a brief update on where I think we're at in terms
>> of iowait behavior regarding cpuidle and cpufreq.
>> I'm still working on getting both removed, given the discussions had
>> on the list [0] and at OSPM [1] this seems realistic and the best way
>> forward IMO.
>> That would then naturally make this series and the iowait workaround in
>> io_uring/io_uring.c unnecessary.
>>
>> 1. For cpuidle:
>> Main issue with relying on nr_iowaiters is that there is no guarantee
>> whatsoever that these tasks will wakeup where they went to sleep so if
>> we can achieve the same throughput without nr_iowaiters it shouldn't
>> be relevant.
>> I spent quite some time in fixing teo [2], because untangling nr_iowaiters
>> from menu seems hard, essentially nobody has worked on menu seriously for
>> a while now. Thus the plan here is to replace menu by teo eventually.
>> For your io_uring workloads I see throughput on par for teo (doesn't rely
>> on iowait) and menu.
>>
>> # echo teo > /sys/devices/system/cpu/cpuidle/current_governor
>> #  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
>> submitter=0, tid=206, file=/dev/nvme0n1, node=-1
>> polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
>> Engine=preadv2
>> IOPS=22500, BW=87MiB/s, IOS/call=0/0
>> IOPS=21916, BW=85MiB/s, IOS/call=1/0
>> IOPS=21774, BW=85MiB/s, IOS/call=1/0
>> IOPS=22467, BW=87MiB/s, IOS/call=1/0
>> Exiting on timeout
>> Maximum IOPS=22500
>> # echo menu > /sys/devices/system/cpu/cpuidle/current_governor
>> [  178.754571] cpuidle: using governor menu
>> #  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
>> submitter=0, tid=209, file=/dev/nvme0n1, node=-1
>> polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
>> Engine=preadv2
>> IOPS=21452, BW=83MiB/s, IOS/call=0/0
>> IOPS=21778, BW=85MiB/s, IOS/call=1/0
>> IOPS=21120, BW=82MiB/s, IOS/call=1/0
>> IOPS=20903, BW=81MiB/s, IOS/call=1/0
>> Exiting on timeout
>> Maximum IOPS=21778
>>
>> Please do give it a try for yourself as well!
>>
>> 2. For cpufreq:
>> Main issue for IO-bound workloads with iowait boosting is we're punishing
>> the 'good' workloads (that don't have iowait sleeps in their throughput-critical
>> part, which is already bad because of the scheduling overhead induced) by
>> making them energy-inefficient to make synthetic benchmarks happy.
>> A study of more realistic workloads show that they don't suffer from a problem
>> of building up utilization, not util_est anyway, so they don't actually benefit
>> from a cpufreq boost.
>> This leads me to the conclusion that cpufreq iowait boosting can be scrapped
>> altogether if we accept some degradation of benchmarks like
>> ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1
>> or
>> fio --name=fio --rw=randread --bs=4k --runtime=5 --time_based --filename=/dev/nvme0n1 --iodepth=1 --numjobs=1
>> (non-io_uring) for that matter.
> 
> The original iowait addition came because a big regression was seen
> compared to not setting iowait, it was around 20% iirc. That's big, and
> not in the realm of "some degradation" that will be acceptable. And that
> will largely depend on the system being used. On some systems, it'll be
> less, and on some it'll be more.

We are also talking about power regressions of 1000% easily FWIW for e.g.
fio --name=fio --rw=randread --bs=4k --runtime=10 --time_based --filename=/dev/nvme0n1 --iodepth=32 --numjobs=nr_cpus --ioengine=io_uring
(without any throughput gain).

> 
>> For io_uring where the expected case is probably not single-threaded
>> sync IO (or iodepth=1) the cpufreq iowait boost is just hurting
>> use-cases by pushing it to less efficient frequencies that might not
>> be needed.
> 
> People do all sorts of things, and sync (or low queue depth) IO is
> certainly one of the use cases. In fact that's where the above report
> came from, on the postgres aio side.

I have looked at that and (on the platforms I've tested) that was indeed
from cpuidle FWIW. Moving away from menu did remedy this with the
mainlined teo fixes.

>> I know you want your problem (io_uring showing up as 100% busy even
>> though it's just sleeping) to be solved like yesterday and my opinion
>> on a future timeline might not be enough to convince you of much. I
>> wanted to share it anyway. I don't see an issue with the actual code
>> you're proposing, but it does feel like a step in the wrong direction
>> to me.
> 
> As mentioned in my original reply, I view this as entirely orthogonal,
> and while I appreciate your efforts in this area, I'm a little tired of
> this being brought up as a gatekeeping metric when it's not there.

I can understand you being tired of me bringing this up, but I'm not
gatekeeping this series, not intentionally anyway.
Just trying to give some perspective on the entire iowait behavior
future.

> 
> If we can eliminate iowait for boosting down the line, then I'm all for
> it. But this has now been pending for > 6 months and I don't think it's
> far to keep stringing this along on a future promise. This isn't a lot
> of code and it solves the issue for now, if the code will get removed
> down the line as not needed, then that's certainly fine. For now, we
> need it.

I'm fine with carrying a revert of the series along my patchset.

Regards,
Christian

Re: [PATCHSET v6 0/4] Split iowait into two states

Posted by Jens Axboe 1 year, 5 months ago

On 8/21/24 9:57 AM, Christian Loehle wrote:
> On 8/21/24 16:04, Jens Axboe wrote:
>> On 8/21/24 8:54 AM, Christian Loehle wrote:
>>> On 8/19/24 16:39, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> This is v6 of the patchset where the current in_iowait state is split
>>>> into two parts:
>>>>
>>>> 1) The "task is sleeping waiting on IO", and would like cpufreq goodness
>>>>    in terms of sleep and wakeup latencies.
>>>> 2) The above, and also accounted as such in the iowait stats.
>>>>
>>>> The current ->in_iowait covers both, this series splits it into two types
>>>> of state so that each can be controlled seperately.
>>>
>>> Hi Jens,
>>> I wanted to give a brief update on where I think we're at in terms
>>> of iowait behavior regarding cpuidle and cpufreq.
>>> I'm still working on getting both removed, given the discussions had
>>> on the list [0] and at OSPM [1] this seems realistic and the best way
>>> forward IMO.
>>> That would then naturally make this series and the iowait workaround in
>>> io_uring/io_uring.c unnecessary.
>>>
>>> 1. For cpuidle:
>>> Main issue with relying on nr_iowaiters is that there is no guarantee
>>> whatsoever that these tasks will wakeup where they went to sleep so if
>>> we can achieve the same throughput without nr_iowaiters it shouldn't
>>> be relevant.
>>> I spent quite some time in fixing teo [2], because untangling nr_iowaiters
>>> from menu seems hard, essentially nobody has worked on menu seriously for
>>> a while now. Thus the plan here is to replace menu by teo eventually.
>>> For your io_uring workloads I see throughput on par for teo (doesn't rely
>>> on iowait) and menu.
>>>
>>> # echo teo > /sys/devices/system/cpu/cpuidle/current_governor
>>> #  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
>>> submitter=0, tid=206, file=/dev/nvme0n1, node=-1
>>> polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
>>> Engine=preadv2
>>> IOPS=22500, BW=87MiB/s, IOS/call=0/0
>>> IOPS=21916, BW=85MiB/s, IOS/call=1/0
>>> IOPS=21774, BW=85MiB/s, IOS/call=1/0
>>> IOPS=22467, BW=87MiB/s, IOS/call=1/0
>>> Exiting on timeout
>>> Maximum IOPS=22500
>>> # echo menu > /sys/devices/system/cpu/cpuidle/current_governor
>>> [  178.754571] cpuidle: using governor menu
>>> #  ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1 
>>> submitter=0, tid=209, file=/dev/nvme0n1, node=-1
>>> polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=1
>>> Engine=preadv2
>>> IOPS=21452, BW=83MiB/s, IOS/call=0/0
>>> IOPS=21778, BW=85MiB/s, IOS/call=1/0
>>> IOPS=21120, BW=82MiB/s, IOS/call=1/0
>>> IOPS=20903, BW=81MiB/s, IOS/call=1/0
>>> Exiting on timeout
>>> Maximum IOPS=21778
>>>
>>> Please do give it a try for yourself as well!
>>>
>>> 2. For cpufreq:
>>> Main issue for IO-bound workloads with iowait boosting is we're punishing
>>> the 'good' workloads (that don't have iowait sleeps in their throughput-critical
>>> part, which is already bad because of the scheduling overhead induced) by
>>> making them energy-inefficient to make synthetic benchmarks happy.
>>> A study of more realistic workloads show that they don't suffer from a problem
>>> of building up utilization, not util_est anyway, so they don't actually benefit
>>> from a cpufreq boost.
>>> This leads me to the conclusion that cpufreq iowait boosting can be scrapped
>>> altogether if we accept some degradation of benchmarks like
>>> ./io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S 1 -R 0 /dev/nvme0n1
>>> or
>>> fio --name=fio --rw=randread --bs=4k --runtime=5 --time_based --filename=/dev/nvme0n1 --iodepth=1 --numjobs=1
>>> (non-io_uring) for that matter.
>>
>> The original iowait addition came because a big regression was seen
>> compared to not setting iowait, it was around 20% iirc. That's big, and
>> not in the realm of "some degradation" that will be acceptable. And that
>> will largely depend on the system being used. On some systems, it'll be
>> less, and on some it'll be more.
> 
> We are also talking about power regressions of 1000% easily FWIW for
> e.g. fio --name=fio --rw=randread --bs=4k --runtime=10 --time_based
> --filename=/dev/nvme0n1 --iodepth=32 --numjobs=nr_cpus
> --ioengine=io_uring (without any throughput gain).

Oh I believe it, for some embeded or low power cpus. And it is on our
list, to make this selectable. Ideally what I think should happen is
that the application gives you a hint on how long it expects to sleep,
and we'll pass that on and let the lower layers decide what's the most
appropriate state to enter. The current iowait usage isn't very pretty
(in io_uring or otherwise, it's too coarse of a hint), but it's what we
have/had, and we needed it to solve a problem that would otherwise be a
regression on a much more common setup than really lower power devices.

>>> For io_uring where the expected case is probably not single-threaded
>>> sync IO (or iodepth=1) the cpufreq iowait boost is just hurting
>>> use-cases by pushing it to less efficient frequencies that might not
>>> be needed.
>>
>> People do all sorts of things, and sync (or low queue depth) IO is
>> certainly one of the use cases. In fact that's where the above report
>> came from, on the postgres aio side.
> 
> I have looked at that and (on the platforms I've tested) that was indeed
> from cpuidle FWIW. Moving away from menu did remedy this with the
> mainlined teo fixes.
> 
>>> I know you want your problem (io_uring showing up as 100% busy even
>>> though it's just sleeping) to be solved like yesterday and my opinion
>>> on a future timeline might not be enough to convince you of much. I
>>> wanted to share it anyway. I don't see an issue with the actual code
>>> you're proposing, but it does feel like a step in the wrong direction
>>> to me.
>>
>> As mentioned in my original reply, I view this as entirely orthogonal,
>> and while I appreciate your efforts in this area, I'm a little tired of
>> this being brought up as a gatekeeping metric when it's not there.
> 
> I can understand you being tired of me bringing this up, but I'm not
> gatekeeping this series, not intentionally anyway.

Well it does feel like that, because this orthogonal (imho) development
is being brought up as a means to not needing to do this. Not just this
posting, but past ones too. Meanwhile, I'd like this problem solved, and
this just adds noise to it as far as I'm concerned. It would be a lot
better to split those two discussions up.

>> If we can eliminate iowait for boosting down the line, then I'm all for
>> it. But this has now been pending for > 6 months and I don't think it's
>> far to keep stringing this along on a future promise. This isn't a lot
>> of code and it solves the issue for now, if the code will get removed
>> down the line as not needed, then that's certainly fine. For now, we
>> need it.
> 
> I'm fine with carrying a revert of the series along my patchset.

OK that's fine, and let's hope we end up in a place down the line that's
a lot better than the iowait on/off we have now, with guesswork based on
past behavior (iow, mostly wrong) on the other end on how long the we
expect to sleep. I'd certainly be all for that, I just don't want future
promises to stop fixing a real issue we have now. If this series goes
away down the line because we don't need it, I surely won't cry over it!

-- 
Jens Axboe