[v2] sched: Don't try to catch up excess steal time.

[PATCH v2] sched: Don't try to catch up excess steal time.

Posted by Suleiman Souhlal 1 year, 4 months ago

When steal time exceeds the measured delta when updating clock_task, we
currently try to catch up the excess in future updates.
However, this results in inaccurate run times for the future things using
clock_task, as they end up getting additional steal time that did not
actually happen.

For example, suppose a task in a VM runs for 10ms and had 15ms of steal
time reported while it ran. clock_task rightly doesn't advance. Then, a
different taks runs on the same rq for 10ms without any time stolen in
the host.
Because of the current catch up mechanism, clock_sched inaccurately ends
up advancing by only 5ms instead of 10ms even though there wasn't any
actual time stolen. The second task is getting charged for less time
than it ran, even though it didn't deserve it.
This can result in tasks getting more run time than they should actually
get.

So, we instead don't make future updates pay back past excess stolen time.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
v2:
- Slightly changed to simply moving one line up instead of adding
  new variable.

v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f3951e4a55e5..6c34de8b3fbb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 	if (static_key_false((&paravirt_steal_rq_enabled))) {
 		steal = paravirt_steal_clock(cpu_of(rq));
 		steal -= rq->prev_steal_time_rq;
+		rq->prev_steal_time_rq += steal;
 
 		if (unlikely(steal > delta))
 			steal = delta;
 
-		rq->prev_steal_time_rq += steal;
 		delta -= steal;
 	}
 #endif
-- 
2.46.0.598.g6f2099f65c-goog

Re: [PATCH v2] sched: Don't try to catch up excess steal time.

Posted by David Woodhouse 1 year, 4 months ago

On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote:
> When steal time exceeds the measured delta when updating clock_task,
> we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future things
> using
> clock_task, as they end up getting additional steal time that did not
> actually happen.
> 
> For example, suppose a task in a VM runs for 10ms and had 15ms of
> steal
> time reported while it ran. clock_task rightly doesn't advance. Then,
> a
> different taks runs on the same rq for 10ms without any time stolen
> in
> the host.
> Because of the current catch up mechanism, clock_sched inaccurately
> ends
> up advancing by only 5ms instead of 10ms even though there wasn't any
> actual time stolen. The second task is getting charged for less time
> than it ran, even though it didn't deserve it.
> This can result in tasks getting more run time than they should
> actually
> get.
> 
> So, we instead don't make future updates pay back past excess stolen
> time.
> 
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
> v2:
> - Slightly changed to simply moving one line up instead of adding
>   new variable.
> 
> v1:
> https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f3951e4a55e5..6c34de8b3fbb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq,
> s64 delta)
>         if (static_key_false((&paravirt_steal_rq_enabled))) {
>                 steal = paravirt_steal_clock(cpu_of(rq));
>                 steal -= rq->prev_steal_time_rq;
> +               rq->prev_steal_time_rq += steal;

The above two lines are essentially:

	steal -= prev;
	prev += steal;

It's like one of those clever ways of exchanging two variables with
three XOR operations. I don't like it :)

Ultimately, you're just setting rq->prev_steal_time_rq to the latest
value that you just read from paravirt_steal_clock(). And then setting
'steal' to the delta between the new reading and the previous one.

The code above is *far* from obvious. At the very least it wants a
comment, but I'd rather see it refactored so that it doesn't need one. 

    u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq));
    steal = abs_steal_now - rq->prev_steal_time_rq;
    rq->prev_steal_time_rq = abs_steal_now;

I'm still not utterly convinced this is the right thing to do, though.
It means you will constantly undermeasure the accounting of steal time
as you discard the excess each time.

The underlying bug here is that we are sampling the steal time and the
time slice at *different* times. This update_rq_clock_task() function
could be called with a calculated 'delta' argument... and then
experience a large amount of steal time *before* calling
paravirt_steal_clock(), which is how we end up in the situation where
the calculated steal time exceeds the running time of the previous
task.

Which task *should* that steal time be accounted to? At the moment it
ends up being accounted to the next task to run — which seems to make
sense to me. In the situation I just described, we can consider the
time stolen in update_rq_clock_task() itself to have been stolen from
the *incoming* task, not the *outgoing* one. But that seems to be what
you're objecting to?

In
https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@infradead.org/
I put a limit on the amount of steal time carried forward from one
timeslice to the next, as it was misbehaving when a bad hypervisor
reported negative steal time. But I don't think the limit should be
zero.

Of course, *ideally* we'd be able to sample the time and steal time
*simultaneously*, with a single sched_clock_cpu_and_steal() function so
that we don't have to deal with this slop between readings. Then none
of this would be necessary. But that seems hard.

Re: [PATCH v2] sched: Don't try to catch up excess steal time.

Posted by Suleiman Souhlal 1 year, 4 months ago

On Wed, Sep 25, 2024 at 12:45:55PM +0100, David Woodhouse wrote:
> On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote:
> > When steal time exceeds the measured delta when updating clock_task,
> > we
> > currently try to catch up the excess in future updates.
> > However, this results in inaccurate run times for the future things
> > using
> > clock_task, as they end up getting additional steal time that did not
> > actually happen.
> > 
> > For example, suppose a task in a VM runs for 10ms and had 15ms of
> > steal
> > time reported while it ran. clock_task rightly doesn't advance. Then,
> > a
> > different taks runs on the same rq for 10ms without any time stolen
> > in
> > the host.
> > Because of the current catch up mechanism, clock_sched inaccurately
> > ends
> > up advancing by only 5ms instead of 10ms even though there wasn't any
> > actual time stolen. The second task is getting charged for less time
> > than it ran, even though it didn't deserve it.
> > This can result in tasks getting more run time than they should
> > actually
> > get.
> > 
> > So, we instead don't make future updates pay back past excess stolen
> > time.
> > 
> > Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> > ---
> > v2:
> > - Slightly changed to simply moving one line up instead of adding
> >   new variable.
> > 
> > v1:
> > https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> > ---
> >  kernel/sched/core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f3951e4a55e5..6c34de8b3fbb 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq,
> > s64 delta)
> >         if (static_key_false((&paravirt_steal_rq_enabled))) {
> >                 steal = paravirt_steal_clock(cpu_of(rq));
> >                 steal -= rq->prev_steal_time_rq;
> > +               rq->prev_steal_time_rq += steal;
> 
> The above two lines are essentially:
> 
> 	steal -= prev;
> 	prev += steal;
> 
> It's like one of those clever ways of exchanging two variables with
> three XOR operations. I don't like it :)
> 
> Ultimately, you're just setting rq->prev_steal_time_rq to the latest
> value that you just read from paravirt_steal_clock(). And then setting
> 'steal' to the delta between the new reading and the previous one.
> 
> The code above is *far* from obvious. At the very least it wants a
> comment, but I'd rather see it refactored so that it doesn't need one. 
> 
>     u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq));
>     steal = abs_steal_now - rq->prev_steal_time_rq;
>     rq->prev_steal_time_rq = abs_steal_now;

That is what v1 did:
https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com/

It is also more obvious to me, but the feedback I received was that
the way in the current iteration is better.

I don't feel strongly about it, and I'd be ok with either version applied. 

> 
> I'm still not utterly convinced this is the right thing to do, though.
> It means you will constantly undermeasure the accounting of steal time
> as you discard the excess each time.
> 
> The underlying bug here is that we are sampling the steal time and the
> time slice at *different* times. This update_rq_clock_task() function
> could be called with a calculated 'delta' argument... and then
> experience a large amount of steal time *before* calling
> paravirt_steal_clock(), which is how we end up in the situation where
> the calculated steal time exceeds the running time of the previous
> task.
> 
> Which task *should* that steal time be accounted to? At the moment it
> ends up being accounted to the next task to run — which seems to make
> sense to me. In the situation I just described, we can consider the
> time stolen in update_rq_clock_task() itself to have been stolen from
> the *incoming* task, not the *outgoing* one. But that seems to be what
> you're objecting to?

This is a good description of the problem, except that the time stolen
in update_rq_clock_task() itself is actually being stolen from the 
outgoing task. This is because we are still trying to calculate how long
it ran for (update_curr()), and time hasn't started ticking for the
incoming task yet. We haven't set the incoming task's exec_start with the
new clock_task time yet.

So, in my opinion, it's wrong to give that time to the incoming task.

> 
> In
> https://lore.kernel.org/all/20240522001817.619072-22-dwmw2@infradead.org/
> I put a limit on the amount of steal time carried forward from one
> timeslice to the next, as it was misbehaving when a bad hypervisor
> reported negative steal time. But I don't think the limit should be
> zero.
> 
> Of course, *ideally* we'd be able to sample the time and steal time
> *simultaneously*, with a single sched_clock_cpu_and_steal() function so
> that we don't have to deal with this slop between readings. Then none
> of this would be necessary. But that seems hard.

I agree that that would be ideal.

Thanks,
-- Suleiman

Re: [PATCH v2] sched: Don't try to catch up excess steal time.

Posted by David Woodhouse 1 year, 4 months ago

On Wed, 2024-09-25 at 13:25 +0000, Suleiman Souhlal wrote:
> On Wed, Sep 25, 2024 at 12:45:55PM +0100, David Woodhouse wrote:
> > On Wed, 2024-09-11 at 20:15 +0900, Suleiman Souhlal wrote:
> > > When steal time exceeds the measured delta when updating clock_task,
> > > we
> > > currently try to catch up the excess in future updates.
> > > However, this results in inaccurate run times for the future things
> > > using
> > > clock_task, as they end up getting additional steal time that did not
> > > actually happen.
> > > 
> > > For example, suppose a task in a VM runs for 10ms and had 15ms of
> > > steal
> > > time reported while it ran. clock_task rightly doesn't advance. Then,
> > > a
> > > different taks runs on the same rq for 10ms without any time stolen
> > > in
> > > the host.
> > > Because of the current catch up mechanism, clock_sched inaccurately
> > > ends
> > > up advancing by only 5ms instead of 10ms even though there wasn't any
> > > actual time stolen. The second task is getting charged for less time
> > > than it ran, even though it didn't deserve it.
> > > This can result in tasks getting more run time than they should
> > > actually
> > > get.
> > > 
> > > So, we instead don't make future updates pay back past excess stolen
> > > time.
> > > 
> > > Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> > > ---
> > > v2:
> > > - Slightly changed to simply moving one line up instead of adding
> > >   new variable.
> > > 
> > > v1:
> > > https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> > > ---
> > >  kernel/sched/core.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f3951e4a55e5..6c34de8b3fbb 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -730,11 +730,11 @@ static void update_rq_clock_task(struct rq *rq,
> > > s64 delta)
> > >         if (static_key_false((&paravirt_steal_rq_enabled))) {
> > >                 steal = paravirt_steal_clock(cpu_of(rq));
> > >                 steal -= rq->prev_steal_time_rq;
> > > +               rq->prev_steal_time_rq += steal;
> > 
> > The above two lines are essentially:
> > 
> >         steal -= prev;
> >         prev += steal;
> > 
> > It's like one of those clever ways of exchanging two variables with
> > three XOR operations. I don't like it :)
> > 
> > Ultimately, you're just setting rq->prev_steal_time_rq to the latest
> > value that you just read from paravirt_steal_clock(). And then setting
> > 'steal' to the delta between the new reading and the previous one.
> > 
> > The code above is *far* from obvious. At the very least it wants a
> > comment, but I'd rather see it refactored so that it doesn't need one. 
> > 
> >     u64 abs_steal_now = paravirt_steal_clock(cpu_of(rq));
> >     steal = abs_steal_now - rq->prev_steal_time_rq;
> >     rq->prev_steal_time_rq = abs_steal_now;
> 
> That is what v1 did:
> https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com/
> 
> It is also more obvious to me, but the feedback I received was that
> the way in the current iteration is better.
> 
> I don't feel strongly about it, and I'd be ok with either version applied. 

Fair enough. Not really a hill anyone should choose to die on, I
suppose.

> > 
> > I'm still not utterly convinced this is the right thing to do, though.
> > It means you will constantly undermeasure the accounting of steal time
> > as you discard the excess each time.
> > 
> > The underlying bug here is that we are sampling the steal time and the
> > time slice at *different* times. This update_rq_clock_task() function
> > could be called with a calculated 'delta' argument... and then
> > experience a large amount of steal time *before* calling
> > paravirt_steal_clock(), which is how we end up in the situation where
> > the calculated steal time exceeds the running time of the previous
> > task.
> > 
> > Which task *should* that steal time be accounted to? At the moment it
> > ends up being accounted to the next task to run — which seems to make
> > sense to me. In the situation I just described, we can consider the
> > time stolen in update_rq_clock_task() itself to have been stolen from
> > the *incoming* task, not the *outgoing* one. But that seems to be what
> > you're objecting to?
> 
> This is a good description of the problem, except that the time stolen
> in update_rq_clock_task() itself is actually being stolen from the 
> outgoing task. This is because we are still trying to calculate how long
> it ran for (update_curr()), and time hasn't started ticking for the
> incoming task yet. We haven't set the incoming task's exec_start with the
> new clock_task time yet.
> 
> So, in my opinion, it's wrong to give that time to the incoming task.

That makes sense. That steal time is actually stolen from *neither*
task, since it's after the 'end' timestamp of the outgoing task, and
before the 'start' timestamp of the incoming task.

So where *should* it be accounted?

Or is it actually correct to drop it completely?

If you can make a coherent case for the fact that dropping it is really
the right thing to do (not *just* that it doesn't belong to the
outgoing task, which is the case you make in your existing commit
message), then I suppose I'm OK with your patch as-is.
>

Re: [PATCH v2] sched: Don't try to catch up excess steal time.

Posted by Suleiman Souhlal 1 year, 4 months ago

On Wed, Sep 25, 2024 at 03:26:56PM +0100, David Woodhouse wrote:
> On Wed, 2024-09-25 at 13:25 +0000, Suleiman Souhlal wrote:
> > > 
> > > I'm still not utterly convinced this is the right thing to do, though.
> > > It means you will constantly undermeasure the accounting of steal time
> > > as you discard the excess each time.
> > > 
> > > The underlying bug here is that we are sampling the steal time and the
> > > time slice at *different* times. This update_rq_clock_task() function
> > > could be called with a calculated 'delta' argument... and then
> > > experience a large amount of steal time *before* calling
> > > paravirt_steal_clock(), which is how we end up in the situation where
> > > the calculated steal time exceeds the running time of the previous
> > > task.
> > > 
> > > Which task *should* that steal time be accounted to? At the moment it
> > > ends up being accounted to the next task to run — which seems to make
> > > sense to me. In the situation I just described, we can consider the
> > > time stolen in update_rq_clock_task() itself to have been stolen from
> > > the *incoming* task, not the *outgoing* one. But that seems to be what
> > > you're objecting to?
> > 
> > This is a good description of the problem, except that the time stolen
> > in update_rq_clock_task() itself is actually being stolen from the 
> > outgoing task. This is because we are still trying to calculate how long
> > it ran for (update_curr()), and time hasn't started ticking for the
> > incoming task yet. We haven't set the incoming task's exec_start with the
> > new clock_task time yet.
> > 
> > So, in my opinion, it's wrong to give that time to the incoming task.
> 
> That makes sense. That steal time is actually stolen from *neither*
> task, since it's after the 'end' timestamp of the outgoing task, and
> before the 'start' timestamp of the incoming task.
> 
> So where *should* it be accounted?
> 
> Or is it actually correct to drop it completely?
> 
> If you can make a coherent case for the fact that dropping it is really
> the right thing to do (not *just* that it doesn't belong to the
> outgoing task, which is the case you make in your existing commit
> message), then I suppose I'm OK with your patch as-is.
> > 

Yes, that's a good way to put it: The excess steal time isn't actually
being stolen from anyone.
And since it's not being stolen from anyone, isn't the right thing to do
to drop it?

There might still be extra steal time that doesn't exceed the current
'delta' from the race between reading the two values, that would still
be erroneously accounted to the outgoing task, which this patch doesn't
address, but we know that any steal > delta is from this race and should
be dropped.

-- Suleiman

Re: [PATCH v2] sched: Don't try to catch up excess steal time.

Posted by David Woodhouse 1 year, 4 months ago

On Wed, 2024-09-25 at 15:15 +0000, Suleiman Souhlal wrote:
> Yes, that's a good way to put it: The excess steal time isn't actually
> being stolen from anyone.
> And since it's not being stolen from anyone, isn't the right thing to do
> to drop it?

It's being stolen from the system, isn't it? Just not any specific
userspace process?

If we have separate "end of outgoing task" and "start of incoming task"
timestamps, surely the time between those two must be accounted
*somewhere*?

> There might still be extra steal time that doesn't exceed the current
> 'delta' from the race between reading the two values, that would still
> be erroneously accounted to the outgoing task, which this patch doesn't
> address, but we know that any steal > delta is from this race and should
> be dropped.

Well that's what we want the atomic paired read for :)

Re: [PATCH v2] sched: Don't try to catch up excess steal time.

Posted by Suleiman Souhlal 1 year, 4 months ago

On Wed, Sep 25, 2024 at 04:34:57PM +0100, David Woodhouse wrote:
> On Wed, 2024-09-25 at 15:15 +0000, Suleiman Souhlal wrote:
> > Yes, that's a good way to put it: The excess steal time isn't actually
> > being stolen from anyone.
> > And since it's not being stolen from anyone, isn't the right thing to do
> > to drop it?
> 
> It's being stolen from the system, isn't it? Just not any specific
> userspace process?

I guess it depends what you mean by "stolen". I would argue that it's not
stolen from anyone since the time isn't actually counted anywhere.

> 
> If we have separate "end of outgoing task" and "start of incoming task"
> timestamps, surely the time between those two must be accounted
> *somewhere*?

Not exactly. clock_task is essentially frozen until the next
update_rq_clock(), at which point we'll look at how much sched_clock_cpu
advanced and subtract how much steal time advanced. The two things
are done in separate spots (update_rq_clock() and update_rq_clock_task()),
indepedently (which is where the race is happening).
As far as I can tell, the time between the two isn't really accounted
anywhere.

The "end of outgoing task" and "start of incoming task" timestamps should
end up being the same.

> 
> > There might still be extra steal time that doesn't exceed the current
> > 'delta' from the race between reading the two values, that would still
> > be erroneously accounted to the outgoing task, which this patch doesn't
> > address, but we know that any steal > delta is from this race and should
> > be dropped.
> 
> Well that's what we want the atomic paired read for :)

Right, but I don't think it's that simple. We aren't only reading memory
but also a clock.
It might be possible to address this with a mechanism like rseq, but that
would be a much bigger patch set than the current one (and I don't think
anyone has ever attempted to do rseq for VMs yet).

(There is also another potential issue I spotted with steal time, that
has to do with reading another VCPU's steal time while it's not running,
but I'll start a separate discussion about that with a different patch set.)

Thanks for the discussion.
-- Suleiman

Re: [PATCH v2] sched: Don't try to catch up excess steal time.

Posted by Suleiman Souhlal 1 year, 4 months ago

On Wed, Sep 11, 2024 at 8:16 PM Suleiman Souhlal <suleiman@google.com> wrote:
>
> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future things using
> clock_task, as they end up getting additional steal time that did not
> actually happen.
>
> For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> time reported while it ran. clock_task rightly doesn't advance. Then, a
> different taks runs on the same rq for 10ms without any time stolen in
> the host.
> Because of the current catch up mechanism, clock_sched inaccurately ends
> up advancing by only 5ms instead of 10ms even though there wasn't any
> actual time stolen. The second task is getting charged for less time
> than it ran, even though it didn't deserve it.
> This can result in tasks getting more run time than they should actually
> get.
>
> So, we instead don't make future updates pay back past excess stolen time.
>
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
> v2:
> - Slightly changed to simply moving one line up instead of adding
>   new variable.
>
> v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Friendly ping.

Thanks,
-- Suleiman

Re: [PATCH v2] sched: Don't try to catch up excess steal time.

Posted by Srikar Dronamraju 1 year, 4 months ago

* Suleiman Souhlal <suleiman@google.com> [2024-09-11 20:15:22]:

> When steal time exceeds the measured delta when updating clock_task, we
> currently try to catch up the excess in future updates.
> However, this results in inaccurate run times for the future things using
> clock_task, as they end up getting additional steal time that did not
> actually happen.
> 
> For example, suppose a task in a VM runs for 10ms and had 15ms of steal
> time reported while it ran. clock_task rightly doesn't advance. Then, a
> different taks runs on the same rq for 10ms without any time stolen in
> the host.
> Because of the current catch up mechanism, clock_sched inaccurately ends
> up advancing by only 5ms instead of 10ms even though there wasn't any
> actual time stolen. The second task is getting charged for less time
> than it ran, even though it didn't deserve it.
> This can result in tasks getting more run time than they should actually
> get.
> 
> So, we instead don't make future updates pay back past excess stolen time.
> 
> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
> v2:
> - Slightly changed to simply moving one line up instead of adding
>   new variable.
> 
> v1: https://lore.kernel.org/lkml/20240806111157.1336532-1-suleiman@google.com
> ---

After moving the line up, this looks good to me.

Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com>

-- 
Thanks and Regards
Srikar Dronamraju