[v1] qemu: Enable SCHED_CORE for domains and helper processes

[PATCH RFC 00/10] qemu: Enable SCHED_CORE for domains and helper processes

Posted by Michal Privoznik 3 years, 9 months ago

The Linux kernel offers a way to mitigate side channel attacks on Hyper
Threads (e.g. MDS and L1TF). Long story short, userspace can define
groups of processes (aka trusted groups) and only processes within one
group can run on sibling Hyper Threads. The group membership is
automatically preserved on fork() and exec().

Now, there is one scenario which I don't cover in my series and I'd like
to hear proposal: if there are two guests with odd number of vCPUs they
can no longer run on sibling Hyper Threads because my patches create
separate group for each QEMU. This is a performance penalty. Ideally, we
would have a knob inside domain XML that would place two or more domains
into the same trusted group. But since there's pre-existing example (of
sharing a piece of information between two domains) I've failed to come
up with something usable.

Also, it's worth noting, that on kernel level, group membership is
expressed by so called 'cookie' which is effectively an unique UL
number, but there's no API that would "set this number on given
process", so we may have to go with some abstraction layer.

Michal Prívozník (10):
  qemu_tpm: Make APIs work over a single virDomainTPMDef
  qemu_dbus: Separate PID read code into qemuDBusGetPID
  qemu_vhost_user_gpu: Export qemuVhostUserGPUGetPid()
  qemu_tpm: Expose qemuTPMEmulatorGetPid()
  qemu_virtiofs: Separate PID read code into qemuVirtioFSGetPid
  virprocess: Core Scheduling support
  virCommand: Introduce APIs for core scheduling
  qemu_conf: Introduce a knob to turn off SCHED_CORE
  qemu: Enable SCHED_CORE for domains and helper processes
  qemu: Place helper processes into the same trusted group

 src/libvirt_private.syms           |   6 +
 src/qemu/libvirtd_qemu.aug         |   1 +
 src/qemu/qemu.conf.in              |   5 +
 src/qemu/qemu_conf.c               |  24 ++++
 src/qemu/qemu_conf.h               |   2 +
 src/qemu/qemu_dbus.c               |  42 ++++---
 src/qemu/qemu_dbus.h               |   4 +
 src/qemu/qemu_extdevice.c          | 171 ++++++++++++++++++++++++++---
 src/qemu/qemu_extdevice.h          |   3 +
 src/qemu/qemu_process.c            |   9 ++
 src/qemu/qemu_security.c           |   4 +
 src/qemu/qemu_tpm.c                |  91 +++++----------
 src/qemu/qemu_tpm.h                |  18 ++-
 src/qemu/qemu_vhost_user_gpu.c     |   2 +-
 src/qemu/qemu_vhost_user_gpu.h     |   8 ++
 src/qemu/qemu_virtiofs.c           |  41 ++++---
 src/qemu/qemu_virtiofs.h           |   5 +
 src/qemu/test_libvirtd_qemu.aug.in |   1 +
 src/util/vircommand.c              |  74 +++++++++++++
 src/util/vircommand.h              |   5 +
 src/util/virprocess.c              | 124 +++++++++++++++++++++
 src/util/virprocess.h              |   8 ++
 22 files changed, 538 insertions(+), 110 deletions(-)

-- 
2.35.1

Re: [PATCH RFC 00/10] qemu: Enable SCHED_CORE for domains and helper processes

Posted by Daniel P. Berrangé 3 years, 8 months ago

On Mon, May 09, 2022 at 05:02:07PM +0200, Michal Privoznik wrote:
> The Linux kernel offers a way to mitigate side channel attacks on Hyper
> Threads (e.g. MDS and L1TF). Long story short, userspace can define
> groups of processes (aka trusted groups) and only processes within one
> group can run on sibling Hyper Threads. The group membership is
> automatically preserved on fork() and exec().
> 
> Now, there is one scenario which I don't cover in my series and I'd like
> to hear proposal: if there are two guests with odd number of vCPUs they
> can no longer run on sibling Hyper Threads because my patches create
> separate group for each QEMU. This is a performance penalty. Ideally, we
> would have a knob inside domain XML that would place two or more domains
> into the same trusted group. But since there's pre-existing example (of
> sharing a piece of information between two domains) I've failed to come
> up with something usable.

Right now users have two choices

  - Run with SMT enabled. 100% of CPUs available. VMs are vulnerable
  - Run with SMT disabled. 50% of CPUs available. VMs are safe

What the core scheduling gives is somewhere inbetween, depending on
the vCPU count. If we assume all guests have even CPUs then

  - Run with SMT enabled + core scheduling. 100% of CPUs available.
    100% of CPUs are used, VMs are safe

This is the ideal scenario, and probably the fairly common scenario
too as IMHO even number CPU counts are likely to be typical.

If we assume the worst case, of entirely 1 vCPU guests then we have

  - Run with SMT enabled + core scheduling. 100% of CPUs available.
    50% of CPUs are used, VMs are safe

This feels highly unlikely though, as all except tiny workloads
want > 1 vCPU.

With entirely 3 vCPU guests then we have

  - Run with SMT enabled + core scheduling. 100% of CPUs available.
    75% of CPUs are used, VMs are safe

With entirely 5 vCPU guests then we have

  - Run with SMT enabled + core scheduling. 100% of CPUs available.
    83% of CPUs are used, VMs are safe

If we have a mix of even and odd numbered vCPU guests, with mostly
even numbered, then I think utilization will  be high enough that
almost no one will care about the last few %.

While we could try to come up with a way to express sharing of
cores between VMs I don't think its worth it, in the absence of
someone presenting compelling data why it'll be needed in a non
niche use case. Bear in mind, that users can also resort to
pinning VMs explicitly to get sharing.

In terms of defaults I'd very much like us to default to enabling
core scheduling, so that we have a secure deployment out of the box.
The only caveat is that this does have the potential to be interpreted
as a regression for existing deployments in some cases. Perhaps we
should make it a meson option for distros to decide whether to ship
with it turned on out of the box or not ?

I don't think we need core scheduling to be a VM XML config option,
because security is really a host level matter IMHO, such that it
does't make sense to have both secure & insecure VMs co-located.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH RFC 00/10] qemu: Enable SCHED_CORE for domains and helper processes

Posted by Dario Faggioli 3 years, 8 months ago

On Mon, 2022-05-23 at 17:13 +0100, Daniel P. Berrangé wrote:
> On Mon, May 09, 2022 at 05:02:07PM +0200, Michal Privoznik wrote:
> In terms of defaults I'd very much like us to default to enabling
> core scheduling, so that we have a secure deployment out of the box.
> The only caveat is that this does have the potential to be
> interpreted
> as a regression for existing deployments in some cases. Perhaps we
> should make it a meson option for distros to decide whether to ship
> with it turned on out of the box or not ?
>
I think, as Michal said, with the qemu.conf knob from patch 8, we will
already have that. I.e., distros will ship a qemu.conf with sched_core
equal to 1 or 0, depending on what they want as a default behavior.

> I don't think we need core scheduling to be a VM XML config option,
> because security is really a host level matter IMHO, such that it
> does't make sense to have both secure & insecure VMs co-located.
> 
Mmm... I can't say that I have any concrete example, but I guess I can
picture a situation where someone has "sensitive" VMs, which he/she
would want to make sure they're protected from the possibility that
other VMs steal their secrets, and "less sensitive" ones, for which
it's not a concern if they share cores and (potentially) steal secrets
among each others (as far as none of those can steal from any
"sensitive" one, but this does not happen, if we set core scheduling
for the latter).

Another scenario would be if core-scheduling is (ab)used for limiting
the interference. Like some sort of flexible and dynamic form of vcpu-
pinning. That is, if I set core-scheduling for VM1, I'm sure that VM1's
vcpus will never share cores with any other VMs. Which is good for
performance and determinism, because it means that it can't happen that
vcpu1 of VM3 runs on the same core of vcpu0 of VM1 and, when VM3-vcpu1
is busy, VM1-vcpu0 slows down as well. Imagine that VM1 and VM3 are
owned by different customers, core-scheduling would allow me to make
sure that whatever customer A is doing in VM3, it can't slow down
customer B, who owns VM1, without having to resort to do vcpu-pinning,
which is unflexible and. And again, maybe we do want this "dynamic
interference shielding" property for some VMs, but not for all... E.g.,
we can offer it as an higher SLA, and ask more money for a VM that has
it.

Thoughts?

In any case, even if we decide that we do want per-VM core-scheduling,
e.g., for the above mentioned reasons, I guess it can come later, as a
further improvement (and I'd be happy to help making it happen).

Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Re: [PATCH RFC 00/10] qemu: Enable SCHED_CORE for domains and helper processes

Posted by Michal Prívozník 3 years, 8 months ago

On 5/26/22 14:01, Dario Faggioli wrote:
> On Mon, 2022-05-23 at 17:13 +0100, Daniel P. Berrangé wrote:
>> On Mon, May 09, 2022 at 05:02:07PM +0200, Michal Privoznik wrote:
>> In terms of defaults I'd very much like us to default to enabling
>> core scheduling, so that we have a secure deployment out of the box.
>> The only caveat is that this does have the potential to be
>> interpreted
>> as a regression for existing deployments in some cases. Perhaps we
>> should make it a meson option for distros to decide whether to ship
>> with it turned on out of the box or not ?
>>
> I think, as Michal said, with the qemu.conf knob from patch 8, we will
> already have that. I.e., distros will ship a qemu.conf with sched_core
> equal to 1 or 0, depending on what they want as a default behavior.
> 
>> I don't think we need core scheduling to be a VM XML config option,
>> because security is really a host level matter IMHO, such that it
>> does't make sense to have both secure & insecure VMs co-located.
>>
> Mmm... I can't say that I have any concrete example, but I guess I can
> picture a situation where someone has "sensitive" VMs, which he/she
> would want to make sure they're protected from the possibility that
> other VMs steal their secrets, and "less sensitive" ones, for which
> it's not a concern if they share cores and (potentially) steal secrets
> among each others (as far as none of those can steal from any
> "sensitive" one, but this does not happen, if we set core scheduling
> for the latter).
> 
> Another scenario would be if core-scheduling is (ab)used for limiting
> the interference. Like some sort of flexible and dynamic form of vcpu-
> pinning. That is, if I set core-scheduling for VM1, I'm sure that VM1's
> vcpus will never share cores with any other VMs. Which is good for
> performance and determinism, because it means that it can't happen that
> vcpu1 of VM3 runs on the same core of vcpu0 of VM1 and, when VM3-vcpu1
> is busy, VM1-vcpu0 slows down as well. Imagine that VM1 and VM3 are
> owned by different customers, core-scheduling would allow me to make
> sure that whatever customer A is doing in VM3, it can't slow down
> customer B, who owns VM1, without having to resort to do vcpu-pinning,
> which is unflexible and. And again, maybe we do want this "dynamic
> interference shielding" property for some VMs, but not for all... E.g.,
> we can offer it as an higher SLA, and ask more money for a VM that has
> it.
> 
> Thoughts?

I'd expect the host scheduler to work around this problem, e.g. it could
run vCPUs of different VMs on different cores. Of course, this assumes
that they are allowed to run on different cores (i.e. they are not
pinned onto the same physical CPU). And if they are then that's
obviously a misconfig on admin side.

Michal

Re: [PATCH RFC 00/10] qemu: Enable SCHED_CORE for domains and helper processes

Posted by Dario Faggioli 3 years, 8 months ago

On Thu, 2022-05-26 at 14:01 +0200, Dario Faggioli wrote:
> Thoughts?
> 
Oh, and there are even a couple of other (potential) use case, for
having an (even more!) fine grained control of core-scheduling.

So, right now, giving a virtual topology to a VM, pretty much only
makes sense if the VM has its vcpus pinned. Well, actually, there's
something that we can do even if that is not the case, especially if we
define at least *some* constraints on where the vcpus can run, even if
we don't have strict and static 1-to-1 pinning... But for sure we
shouldn't define an SMT topology, if we don't have that (i.e., if we
don't have strict and static 1-to-1 pinning). And yet, the vcpus will
run on cores and threads!

Now, if we implement per-vcpu core-scheduling (which means being able
to put not necessarily whole VMs, but single vcpus [although, of the
same VM], in trusted groups), then we can:
- put vcpu0 and vcpu1 of VM1 in a group
- put vcpu2 and vcpu3 of VM1 in a(nother!) group
- define, in the virtual topology of VM1, vcpu0 and vcpu1 as
  SMT-threads of the same core
- define, in the virtual topology of VM1, vcpu2 and vcpu3 as
  SMT-threads of the same core

From the perspective of the accuracy of the mapping between virtual and
physical topology (and hence, most likely, of performance), it's still
a mixed bag. I.e., on an idle or lightly loaded system, vcpu0 and vcpu1
can still run on two different cores. So, if the guest kernel and apps
assume that the two vcpus are SMT-siblings, and optimize for that,
well, that still might be false/wrong (like it would be without any
core-scheduling, without any pinning, etc). At least, when they run on
different cores, they run there alone, which is nice (but achievable
with per-VM core-scheduling already).

On a heavily loaded system, instead, vcpu0 and vcpu1 should (when they
both want to run) have much higher chances of actually ending up
running on the same core. [Of couse, not necessarily always on one same
specific core --like when we do pinning-- but always on the one core.]
So, in-guest workloads operating under the assumption that those two
vcpus are SMT-siblings, will hopefully benefit from that.

And for the lightly loaded case, well, I believe that combining
per-vcpu core-scheduling + SMT virtual topology with *some* kind of
vcpu affinity (and I mean something more flexible and less wasteful
than 1-to-1 pinning, of course!) and/or with something like numad, will
actually bring some performance and determinism benefits, even in such
a scenario... But, of course, we need data for that, and I don't have
any yet. :-)

Anyway, let's now consider the case where the user/customer wants to be
able use core-scheduling _inside_ of the guest, e.g., for protecting
and/or shielding, some sensitive workload that he/she is running inside
of the VM itself, from all the other tasks. But for using core-
scheduling inside of the guest we need the guest to have cores and
threads. And for the protection/shielding to be effective, we need to
be sure that, say, if two guest tasks are in the same trusted group and
are running on two vcpus that are virtual SMT-siblings, these two vcpus
either (1) run on two actual physical SMT-siblings pCPUs on the host
(i.e., on they run on the same core), or (2) run on different host
cores, each one on a thread, with no other vCPU from any other VM (and
no host task, for what matters) running on the other thread. And this
is exactly what per-vcpu core-scheduling + SMT virtual topology gives
us. :-D

Of course, as in the previous message, I think that it's perfectly fine
for something like this to not be implemented immediately, and come
later. At least as far as we don't do anything at this stage that will
prevent/make it difficult to implement such extensions in future.

Which I guess is, after all, the main point of these very long emails
(sorry!) that I am writing. I.e., _if_ we agree that it might be
interesting to have per-VM, or even per-vcpu, core-scheduling in the
future, let's just try to make sure that what we put together now
(especially at the interface level) is easy to extend in that
direction. :-)

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)

Re: [PATCH RFC 00/10] qemu: Enable SCHED_CORE for domains and helper processes

Posted by Michal Prívozník 3 years, 8 months ago

On 5/26/22 16:00, Dario Faggioli wrote:
> On Thu, 2022-05-26 at 14:01 +0200, Dario Faggioli wrote:
>> Thoughts?
>>
> Oh, and there are even a couple of other (potential) use case, for
> having an (even more!) fine grained control of core-scheduling.
> 
> So, right now, giving a virtual topology to a VM, pretty much only
> makes sense if the VM has its vcpus pinned. Well, actually, there's
> something that we can do even if that is not the case, especially if we
> define at least *some* constraints on where the vcpus can run, even if
> we don't have strict and static 1-to-1 pinning... But for sure we
> shouldn't define an SMT topology, if we don't have that (i.e., if we
> don't have strict and static 1-to-1 pinning). And yet, the vcpus will
> run on cores and threads!
> 
> Now, if we implement per-vcpu core-scheduling (which means being able
> to put not necessarily whole VMs, but single vcpus [although, of the
> same VM], in trusted groups), then we can:
> - put vcpu0 and vcpu1 of VM1 in a group
> - put vcpu2 and vcpu3 of VM1 in a(nother!) group
> - define, in the virtual topology of VM1, vcpu0 and vcpu1 as
>   SMT-threads of the same core
> - define, in the virtual topology of VM1, vcpu2 and vcpu3 as
>   SMT-threads of the same core

These last two we can't do ourselves really. It has to come from domain
definition. Otherwise we might break guest ABI, because unless
configured in domain XML all vCPUs are different cores (e.g.
<vcpu>4</vcpu> gives you a 4 core vCPU).

What we could do is to utilize cpu topology, regardless of pinning. I
mean, for the following config:

  <vcpu>4</vcpu>
  <cpu>
    <topology sockets='1' dies='1' cores='2' threads='2'/>
  </cpu>

which gives you two threads in two cores. Now, we could place threads of
one core into one group, and the other two threads of the other core
into another group.

Ideally, I'd like to avoid computing an intersection with pinning
because that will get hairy pretty quickly (as you demonstrated in this
e-mail). For properly pinned vCPUs this won't be any performance penalty
(yes, it's still possible to come up with an artificial counter
example), and for "misconfigured" pinning, well tough luck.

Michal

Re: [PATCH RFC 00/10] qemu: Enable SCHED_CORE for domains and helper processes

Posted by Michal Prívozník 3 years, 8 months ago

On 5/23/22 18:13, Daniel P. Berrangé wrote:
> On Mon, May 09, 2022 at 05:02:07PM +0200, Michal Privoznik wrote:
>> The Linux kernel offers a way to mitigate side channel attacks on Hyper
>> Threads (e.g. MDS and L1TF). Long story short, userspace can define
>> groups of processes (aka trusted groups) and only processes within one
>> group can run on sibling Hyper Threads. The group membership is
>> automatically preserved on fork() and exec().
>>
>> Now, there is one scenario which I don't cover in my series and I'd like
>> to hear proposal: if there are two guests with odd number of vCPUs they
>> can no longer run on sibling Hyper Threads because my patches create
>> separate group for each QEMU. This is a performance penalty. Ideally, we
>> would have a knob inside domain XML that would place two or more domains
>> into the same trusted group. But since there's pre-existing example (of
>> sharing a piece of information between two domains) I've failed to come
>> up with something usable.
> 
> Right now users have two choices
> 
>   - Run with SMT enabled. 100% of CPUs available. VMs are vulnerable
>   - Run with SMT disabled. 50% of CPUs available. VMs are safe
> 
> What the core scheduling gives is somewhere inbetween, depending on
> the vCPU count. If we assume all guests have even CPUs then
> 
>   - Run with SMT enabled + core scheduling. 100% of CPUs available.
>     100% of CPUs are used, VMs are safe
> 
> This is the ideal scenario, and probably the fairly common scenario
> too as IMHO even number CPU counts are likely to be typical.
> 
> If we assume the worst case, of entirely 1 vCPU guests then we have
> 
>   - Run with SMT enabled + core scheduling. 100% of CPUs available.
>     50% of CPUs are used, VMs are safe
> 
> This feels highly unlikely though, as all except tiny workloads
> want > 1 vCPU.
> 
> With entirely 3 vCPU guests then we have
> 
>   - Run with SMT enabled + core scheduling. 100% of CPUs available.
>     75% of CPUs are used, VMs are safe
> 
> With entirely 5 vCPU guests then we have
> 
>   - Run with SMT enabled + core scheduling. 100% of CPUs available.
>     83% of CPUs are used, VMs are safe
> 
> If we have a mix of even and odd numbered vCPU guests, with mostly
> even numbered, then I think utilization will  be high enough that
> almost no one will care about the last few %.
> 
> While we could try to come up with a way to express sharing of
> cores between VMs I don't think its worth it, in the absence of
> someone presenting compelling data why it'll be needed in a non
> niche use case. Bear in mind, that users can also resort to
> pinning VMs explicitly to get sharing.
> 
> In terms of defaults I'd very much like us to default to enabling
> core scheduling, so that we have a secure deployment out of the box.
> The only caveat is that this does have the potential to be interpreted
> as a regression for existing deployments in some cases. Perhaps we
> should make it a meson option for distros to decide whether to ship
> with it turned on out of the box or not ?

Alternatively, distros can just patch qemu_conf.c which enables the
option in cfg (virQEMUDriverConfigNew()).

Michal

Re: [PATCH RFC 00/10] qemu: Enable SCHED_CORE for domains and helper processes

Posted by Michal Prívozník 3 years, 8 months ago

On 5/9/22 17:02, Michal Privoznik wrote:
>

Polite ping.

Michal

Re: [PATCH RFC 00/10] qemu: Enable SCHED_CORE for domains and helper processes

Posted by Michal Prívozník 3 years, 8 months ago

On 5/18/22 14:48, Michal Prívozník wrote:
> On 5/9/22 17:02, Michal Privoznik wrote:
>>
> 
> Polite ping.

Less polite ping.

Michal