XEN scheduling hardening

[Xen-devel] [RFC 0/6] XEN scheduling hardening

Posted by Andrii Anisov 4 years, 9 months ago

From: Andrii Anisov <andrii_anisov@epam.com>

This is the very RFC series, which is aimed to address some of VCPU time
accounting problems which affect scheduling fairness and accuracy. Please
note that this is done for ARM64 yet.

One of the scheduling problems is a misleading CPU idle time concept. Now
for the CPU idle time, it is taken an idle vcpu run time. But idle vcpu run
time includes IRQ processing, softirqs processing, tasklets processing, etc.
Those tasks are not actual idle and they accounting may mislead CPU freq
governors who rely on the CPU idle time. In this series, it is suggested to
take the time of the actual CPU low power mode as the idle time. 

The other problem is that pure hypervisor tasks execution time is charged from
the guest vcpu budget. For example, IRQ and softirq processing time are charged
from the current vcpu budget, which is likely the guest vcpu. This is quite
unfair and may break scheduling reliability. It is proposed to charge guest
vcpus for the guest actual run time and time to serve guest's hypercalls and
access to emulated iomem. All the rest is calculated as the hypervisor run time
(IRQ and softirq processing, branch prediction hardening, etc.)

While the series is the early RFC, several points are still untouched:
 - Now the time elapsed from the last rescheduling is not fully charged from
   the current vcpu budget. Are there any changes needed in the existing
   scheduling algorithms?
 - How to avoid the absolute top priority of tasklets (what is obeyed by all
   schedulers so far). Should idle vcpu be scheduled as the normal guest vcpus
   (through queues, priorities, etc)?
 - Idle vcpu naming is quite misleading. It is a kind of system (hypervisor)
   task which is responsible for some hypervisor work. Should it be
   renamed/reconsidered?

Andrii Anisov (5):
  schedule: account true system idle time
  sysctl: extend XEN_SYSCTL_getcpuinfo interface
  xentop: show CPU load information
  arm64: call enter_hypervisor_head only when it is needed
  schedule: account all the hypervisor time to the idle vcpu

Julien Grall (1):
  xen/arm: Re-enable interrupt later in the trap path

 tools/xenstat/libxenstat/src/xenstat.c      |  38 +++++++++
 tools/xenstat/libxenstat/src/xenstat.h      |   9 ++
 tools/xenstat/libxenstat/src/xenstat_priv.h |   3 +
 tools/xenstat/xentop/xentop.c               |  30 +++++++
 xen/arch/arm/arm64/entry.S                  |  17 ++--
 xen/arch/arm/domain.c                       |  24 ++++++
 xen/arch/arm/traps.c                        | 128 +++++++++++++++++++---------
 xen/common/sched_credit.c                   |   2 +-
 xen/common/sched_credit2.c                  |   4 +-
 xen/common/sched_rt.c                       |   2 +-
 xen/common/schedule.c                       |  98 ++++++++++++++++++---
 xen/common/sysctl.c                         |   2 +
 xen/include/public/sysctl.h                 |   2 +
 xen/include/xen/sched.h                     |   7 ++
 14 files changed, 303 insertions(+), 63 deletions(-)

-- 
2.7.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 0/6] XEN scheduling hardening

Posted by Dario Faggioli 4 years, 9 months ago

[Adding George plus others x86, ARM and core-Xen people]

Hi Andrii,

First of all, thanks a lot for this series!

The problem you mention is a long standing one, and I'm glad we're
eventually starting to properly look into it.

I already have one comment: I think I can see from where this come
from, but I don't think 'XEN scheduling hardening' is what we're doing
in this series... I'd go for something like "xen: sched: improve idle
and vcpu time accounting precision", or something like that.

On Fri, 2019-07-26 at 13:37 +0300, Andrii Anisov wrote:
> One of the scheduling problems is a misleading CPU idle time concept.
> Now
> for the CPU idle time, it is taken an idle vcpu run time. But idle
> vcpu run
> time includes IRQ processing, softirqs processing, tasklets
> processing, etc.
> Those tasks are not actual idle and they accounting may mislead CPU
> freq
> governors who rely on the CPU idle time. 
>
Indeed! And I agree this is quite bad.

> The other problem is that pure hypervisor tasks execution time is
> charged from
> the guest vcpu budget. 
>
Yep, equally bad.

> For example, IRQ and softirq processing time are charged
> from the current vcpu budget, which is likely the guest vcpu. This is
> quite
> unfair and may break scheduling reliability. 
> It is proposed to charge guest
> vcpus for the guest actual run time and time to serve guest's
> hypercalls and
> access to emulated iomem. All the rest is calculated as the
> hypervisor run time
> (IRQ and softirq processing, branch prediction hardening, etc.)
> 
Right.

> While the series is the early RFC, several points are still
> untouched:
>  - Now the time elapsed from the last rescheduling is not fully
> charged from
>    the current vcpu budget. Are there any changes needed in the
> existing
>    scheduling algorithms?
>
I'll think about it, but out of the top of my head, I don't see how
this can be a problem. Scheduling algorithms (should!) base their logic
and their calculations on actual vcpus' runtime, not much on idle
vcpus' one.

>  - How to avoid the absolute top priority of tasklets (what is obeyed
> by all
>    schedulers so far). Should idle vcpu be scheduled as the normal
> guest vcpus
>    (through queues, priorities, etc)?
>
Now, this is something to think about, and try to understand if
anything would break if we go for it. I mean, I see why you'd want to
do that, but tasklets and softirqs works the way they do, in Xen, since
when they were introduced, I believe.

Therefore, even if there wouldn't be any subsystem explicitly relying
on the current behavior (which should be verified), I think we are at
high risk of breaking things, if we change.

That's not to mean it would not be a good change, or that it is
impossible... It's, rather, just to raise some awareness. :-)

>  - Idle vcpu naming is quite misleading. It is a kind of system
> (hypervisor)
>    task which is responsible for some hypervisor work. Should it be
>    renamed/reconsidered?
> 
Well, that's a design question, even for this very series, isn't it? I
mean, I see two ways of achieving proper idle time accounting:
1) you leave things as they are --i.e., idle does not only do idling, 
   it also does all these other things, but you make sure you don't 
   count the time they take as idle time;
2) you move all these activities out of idle, and in some other 
   context, and you let idle just do the idling. At that point, time 
   accounted to idle will be only actual idle time, as the time it 
   took to Xen to do all the other things is now accounted to the new 
   execution context which is running them.

So, which path this path series takes (I believe 1), and which path you
(and others) believe is better?

(And, yes, discussing this is why I've added, apart from George, some
other x86, ARM, and core-Xen people)

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 0/6] XEN scheduling hardening

Posted by Juergen Gross 4 years, 9 months ago

On 26.07.19 13:56, Dario Faggioli wrote:
> [Adding George plus others x86, ARM and core-Xen people]
> 
> Hi Andrii,
> 
> First of all, thanks a lot for this series!
> 
> The problem you mention is a long standing one, and I'm glad we're
> eventually starting to properly look into it.
> 
> I already have one comment: I think I can see from where this come
> from, but I don't think 'XEN scheduling hardening' is what we're doing
> in this series... I'd go for something like "xen: sched: improve idle
> and vcpu time accounting precision", or something like that.
> 
> On Fri, 2019-07-26 at 13:37 +0300, Andrii Anisov wrote:
>> One of the scheduling problems is a misleading CPU idle time concept.
>> Now
>> for the CPU idle time, it is taken an idle vcpu run time. But idle
>> vcpu run
>> time includes IRQ processing, softirqs processing, tasklets
>> processing, etc.
>> Those tasks are not actual idle and they accounting may mislead CPU
>> freq
>> governors who rely on the CPU idle time.
>>
> Indeed! And I agree this is quite bad.
> 
>> The other problem is that pure hypervisor tasks execution time is
>> charged from
>> the guest vcpu budget.
>>
> Yep, equally bad.
> 
>> For example, IRQ and softirq processing time are charged
>> from the current vcpu budget, which is likely the guest vcpu. This is
>> quite
>> unfair and may break scheduling reliability.
>> It is proposed to charge guest
>> vcpus for the guest actual run time and time to serve guest's
>> hypercalls and
>> access to emulated iomem. All the rest is calculated as the
>> hypervisor run time
>> (IRQ and softirq processing, branch prediction hardening, etc.)
>>
> Right.
> 
>> While the series is the early RFC, several points are still
>> untouched:
>>   - Now the time elapsed from the last rescheduling is not fully
>> charged from
>>     the current vcpu budget. Are there any changes needed in the
>> existing
>>     scheduling algorithms?
>>
> I'll think about it, but out of the top of my head, I don't see how
> this can be a problem. Scheduling algorithms (should!) base their logic
> and their calculations on actual vcpus' runtime, not much on idle
> vcpus' one.
> 
>>   - How to avoid the absolute top priority of tasklets (what is obeyed
>> by all
>>     schedulers so far). Should idle vcpu be scheduled as the normal
>> guest vcpus
>>     (through queues, priorities, etc)?
>>
> Now, this is something to think about, and try to understand if
> anything would break if we go for it. I mean, I see why you'd want to
> do that, but tasklets and softirqs works the way they do, in Xen, since
> when they were introduced, I believe.
> 
> Therefore, even if there wouldn't be any subsystem explicitly relying
> on the current behavior (which should be verified), I think we are at
> high risk of breaking things, if we change.

We'd break things IMO.

Tasklets are sometimes used to perform async actions which can't be done
in guest vcpu context. Like switching a domain to shadow mode for L1TF
mitigation, or marshalling all cpus for stop_machine(). You don't want
to be able to block tasklets, you want them to run as soon as possible.

> 
> That's not to mean it would not be a good change, or that it is
> impossible... It's, rather, just to raise some awareness. :-)
> 
>>   - Idle vcpu naming is quite misleading. It is a kind of system
>> (hypervisor)
>>     task which is responsible for some hypervisor work. Should it be
>>     renamed/reconsidered?
>>
> Well, that's a design question, even for this very series, isn't it? I
> mean, I see two ways of achieving proper idle time accounting:
> 1) you leave things as they are --i.e., idle does not only do idling,
>     it also does all these other things, but you make sure you don't
>     count the time they take as idle time;
> 2) you move all these activities out of idle, and in some other
>     context, and you let idle just do the idling. At that point, time
>     accounted to idle will be only actual idle time, as the time it
>     took to Xen to do all the other things is now accounted to the new
>     execution context which is running them.

And here we are coming back to the idea of a "hypervisor domain" I
suggested about 10 years ago and which was rejected at that time...


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 0/6] XEN scheduling hardening

Posted by Andrii Anisov 4 years, 9 months ago

Hello Juergen,

On 26.07.19 15:14, Juergen Gross wrote:
>>>   - How to avoid the absolute top priority of tasklets (what is obeyed
>>> by all
>>>     schedulers so far). Should idle vcpu be scheduled as the normal
>>> guest vcpus
>>>     (through queues, priorities, etc)?
>>>
>> Now, this is something to think about, and try to understand if
>> anything would break if we go for it. I mean, I see why you'd want to
>> do that, but tasklets and softirqs works the way they do, in Xen, since
>> when they were introduced, I believe.
>>
>> Therefore, even if there wouldn't be any subsystem explicitly relying
>> on the current behavior (which should be verified), I think we are at
>> high risk of breaking things, if we change.
> 
> We'd break things IMO.
> 
> Tasklets are sometimes used to perform async actions which can't be done
> in guest vcpu context.

OK, that stuff can not be done in the guest vcpu context. But why can't it be done with the guest's priority?

> Like switching a domain to shadow mode for L1TF
> mitigation,

Sorry I'm not really aware what L1TF mitigation is.
But

> or marshalling all cpus for stop_machine().

I think I faced some time ago. When fixed a noticeable lag in video playback at the moment of the other domain reboot.

> You don't want
> to be able to block tasklets, you want them to run as soon as possible.

Should it really be done in the way of breaking scheduling? What if the system has RT requirements?

-- 
Sincerely,
Andrii Anisov.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 0/6] XEN scheduling hardening

Posted by Dario Faggioli 4 years, 9 months ago

On Mon, 2019-07-29 at 17:47 +0300, Andrii Anisov wrote:
> On 26.07.19 15:14, Juergen Gross wrote:
> > > > 
> > Tasklets are sometimes used to perform async actions which can't be
> > done
> > in guest vcpu context.
> > [...]
> > Like switching a domain to shadow mode for L1TF
> > mitigation,
> 
> Sorry I'm not really aware what L1TF mitigation is.
> But
> 
> > or marshalling all cpus for stop_machine().
> 
> I think I faced some time ago. When fixed a noticeable lag in video
> playback at the moment of the other domain reboot.
> 
No, stop_machine() is not about a VM being shutdown/killed/stopped.
It's, let's say, about all the (physical) CPU in the host having to do
something, in a coordinated way.

Check the comment in xen/include/xen/stop_machine.h

> > You don't want
> > to be able to block tasklets, you want them to run as soon as
> > possible.
> 
> Should it really be done in the way of breaking scheduling? What if
> the system has RT requirements?
> 
It's possible, I guess, that we can implement some of the things that
are done in tasklets (either stop_machine() or something else)
differently, and in a way that is less disruptive for scheduling and
determinism.

But, if it is, it's not going to be as easy as <<let's run tasklet at
domain priority, and be done with it>>, I'm afraid. :-(

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 0/6] XEN scheduling hardening

Posted by Dario Faggioli 4 years, 9 months ago

On Fri, 2019-07-26 at 14:14 +0200, Juergen Gross wrote:
> On 26.07.19 13:56, Dario Faggioli wrote:
> > On Fri, 2019-07-26 at 13:37 +0300, Andrii Anisov wrote:
> > >   - How to avoid the absolute top priority of tasklets (what is
> > > obeyed
> > > by all
> > >     schedulers so far). Should idle vcpu be scheduled as the
> > > normal
> > > guest vcpus
> > >     (through queues, priorities, etc)?
> > > 
> > Therefore, even if there wouldn't be any subsystem explicitly
> > relying
> > on the current behavior (which should be verified), I think we are
> > at
> > high risk of breaking things, if we change.
> 
> We'd break things IMO.
> 
> Tasklets are sometimes used to perform async actions which can't be
> done
> in guest vcpu context. Like switching a domain to shadow mode for
> L1TF
> mitigation, or marshalling all cpus for stop_machine(). You don't
> want
> to be able to block tasklets, you want them to run as soon as
> possible.
> 
Yep, stop-machine was precisely what I had in mind, but as Juergen
says, there's more.

As said, I suggest we defer this problem or, in general, we treat it
outside of this series.

> > 2) you move all these activities out of idle, and in some other
> >     context, and you let idle just do the idling. At that point,
> > time
> >     accounted to idle will be only actual idle time, as the time it
> >     took to Xen to do all the other things is now accounted to the
> > new
> >     execution context which is running them.
> 
> And here we are coming back to the idea of a "hypervisor domain" I
> suggested about 10 years ago and which was rejected at that time...
> 
It's pretty much what Andrii is proposing already, when he says he'd
consider idle_vcpu an 'hypervisor vcpu'. Or at least a naturale
extension of that.

I don't know what was the occasion for proposing it, and the argument
against it, 10 years ago, so I won't comment on that. :-D

Let's see if something like that end up making sense for this work. I'm
unconvinced, for now, but I'm still looking at and thinking about the
code. :-)

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 0/6] XEN scheduling hardening

Posted by Juergen Gross 4 years, 9 months ago

On 29.07.19 13:53, Dario Faggioli wrote:
> On Fri, 2019-07-26 at 14:14 +0200, Juergen Gross wrote:
>> On 26.07.19 13:56, Dario Faggioli wrote:
>>> On Fri, 2019-07-26 at 13:37 +0300, Andrii Anisov wrote:
>>>>    - How to avoid the absolute top priority of tasklets (what is
>>>> obeyed
>>>> by all
>>>>      schedulers so far). Should idle vcpu be scheduled as the
>>>> normal
>>>> guest vcpus
>>>>      (through queues, priorities, etc)?
>>>>
>>> Therefore, even if there wouldn't be any subsystem explicitly
>>> relying
>>> on the current behavior (which should be verified), I think we are
>>> at
>>> high risk of breaking things, if we change.
>>
>> We'd break things IMO.
>>
>> Tasklets are sometimes used to perform async actions which can't be
>> done
>> in guest vcpu context. Like switching a domain to shadow mode for
>> L1TF
>> mitigation, or marshalling all cpus for stop_machine(). You don't
>> want
>> to be able to block tasklets, you want them to run as soon as
>> possible.
>>
> Yep, stop-machine was precisely what I had in mind, but as Juergen
> says, there's more.
> 
> As said, I suggest we defer this problem or, in general, we treat it
> outside of this series.
> 
>>> 2) you move all these activities out of idle, and in some other
>>>      context, and you let idle just do the idling. At that point,
>>> time
>>>      accounted to idle will be only actual idle time, as the time it
>>>      took to Xen to do all the other things is now accounted to the
>>> new
>>>      execution context which is running them.
>>
>> And here we are coming back to the idea of a "hypervisor domain" I
>> suggested about 10 years ago and which was rejected at that time...
>>
> It's pretty much what Andrii is proposing already, when he says he'd
> consider idle_vcpu an 'hypervisor vcpu'. Or at least a naturale
> extension of that.

The main difference is its priority and the possibility to allow it to
become blocked.

> I don't know what was the occasion for proposing it, and the argument
> against it, 10 years ago, so I won't comment on that. :-D

It was in the discussion of my initial submission of cpupools. It was
one alternative thought to solve the continue_hypercall_on_cpu()
problem. In the end that was done via tasklets. :-)


Juergen


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC 0/6] XEN scheduling hardening

Posted by Andrii Anisov 4 years, 9 months ago

Hello Dario

On 26.07.19 14:56, Dario Faggioli wrote:
> [Adding George plus others x86, ARM and core-Xen people]
> 
> Hi Andrii,
> 
> First of all, thanks a lot for this series!
> 
> The problem you mention is a long standing one, and I'm glad we're
> eventually starting to properly look into it.
> 
> I already have one comment: I think I can see from where this come
> from, but I don't think 'XEN scheduling hardening' is what we're doing
> in this series... I'd go for something like "xen: sched: improve idle
> and vcpu time accounting precision", or something like that.

I do not really stick at the naming. Will rename on the next version.

>> While the series is the early RFC, several points are still
>> untouched:
>>   - Now the time elapsed from the last rescheduling is not fully
>> charged from
>>     the current vcpu budget. Are there any changes needed in the
>> existing
>>     scheduling algorithms?
>>
> I'll think about it, but out of the top of my head, I don't see how
> this can be a problem. Scheduling algorithms (should!) base their logic
> and their calculations on actual vcpus' runtime, not much on idle
> vcpus' one.

IMO RTDS and ARINC653 scheduling algorithms are not affected because they are operating with the absolute value of time spent by vcpu and a future event (nearest deadline or major frame end).
But I have my doubts about credit schedulers (credit, credit2). Now we have an entity which unconditionally steals time from some periods. Wouldn't it affect calculation of domains budget proportions with the respect to the domains weight/cap?


>>   - How to avoid the absolute top priority of tasklets (what is obeyed
>> by all
>>     schedulers so far). Should idle vcpu be scheduled as the normal
>> guest vcpus
>>     (through queues, priorities, etc)?
>>
> Now, this is something to think about, and try to understand if
> anything would break if we go for it. I mean, I see why you'd want to
> do that, but tasklets and softirqs works the way they do, in Xen, since
> when they were introduced, I believe.
> 
> Therefore, even if there wouldn't be any subsystem explicitly relying
> on the current behavior (which should be verified), I think we are at
> high risk of breaking things, if we change.
> 
> That's not to mean it would not be a good change, or that it is
> impossible... It's, rather, just to raise some awareness. :-)

I understand that this area is conservative and hard to change.
But the current scheduling in XEN is quite non-deterministic. And, IMO, with that mess XEN can not go into any safety certified system.

>>   - Idle vcpu naming is quite misleading. It is a kind of system
>> (hypervisor)
>>     task which is responsible for some hypervisor work. Should it be
>>     renamed/reconsidered?
>>
> Well, that's a design question, even for this very series, isn't it? I
> mean, I see two ways of achieving proper idle time accounting:
> 1) you leave things as they are --i.e., idle does not only do idling,
>     it also does all these other things, but you make sure you don't
>     count the time they take as idle time;
> 2) you move all these activities out of idle, and in some other
>     context, and you let idle just do the idling. At that point, time
>     accounted to idle will be only actual idle time, as the time it
>     took to Xen to do all the other things is now accounted to the new
>     execution context which is running them.
> 
> So, which path this path series takes (I believe 1), and which path you
> (and others) believe is better?

This have to be discussed.
I would stress again this is the set of minimal changes following existing approaches (e.g. I don't like runstate usage here)

> (And, yes, discussing this is why I've added, apart from George, some
> other x86, ARM, and core-Xen people)

Thank you.

-- 
Sincerely,
Andrii Anisov.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel