src/libvirt_private.syms | 6 + src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 5 + src/qemu/qemu_conf.c | 24 ++++ src/qemu/qemu_conf.h | 2 + src/qemu/qemu_dbus.c | 42 ++++--- src/qemu/qemu_dbus.h | 4 + src/qemu/qemu_extdevice.c | 171 ++++++++++++++++++++++++++--- src/qemu/qemu_extdevice.h | 3 + src/qemu/qemu_process.c | 9 ++ src/qemu/qemu_security.c | 4 + src/qemu/qemu_tpm.c | 91 +++++---------- src/qemu/qemu_tpm.h | 18 ++- src/qemu/qemu_vhost_user_gpu.c | 2 +- src/qemu/qemu_vhost_user_gpu.h | 8 ++ src/qemu/qemu_virtiofs.c | 41 ++++--- src/qemu/qemu_virtiofs.h | 5 + src/qemu/test_libvirtd_qemu.aug.in | 1 + src/util/vircommand.c | 74 +++++++++++++ src/util/vircommand.h | 5 + src/util/virprocess.c | 124 +++++++++++++++++++++ src/util/virprocess.h | 8 ++ 22 files changed, 538 insertions(+), 110 deletions(-)
The Linux kernel offers a way to mitigate side channel attacks on Hyper Threads (e.g. MDS and L1TF). Long story short, userspace can define groups of processes (aka trusted groups) and only processes within one group can run on sibling Hyper Threads. The group membership is automatically preserved on fork() and exec(). Now, there is one scenario which I don't cover in my series and I'd like to hear proposal: if there are two guests with odd number of vCPUs they can no longer run on sibling Hyper Threads because my patches create separate group for each QEMU. This is a performance penalty. Ideally, we would have a knob inside domain XML that would place two or more domains into the same trusted group. But since there's pre-existing example (of sharing a piece of information between two domains) I've failed to come up with something usable. Also, it's worth noting, that on kernel level, group membership is expressed by so called 'cookie' which is effectively an unique UL number, but there's no API that would "set this number on given process", so we may have to go with some abstraction layer. Michal Prívozník (10): qemu_tpm: Make APIs work over a single virDomainTPMDef qemu_dbus: Separate PID read code into qemuDBusGetPID qemu_vhost_user_gpu: Export qemuVhostUserGPUGetPid() qemu_tpm: Expose qemuTPMEmulatorGetPid() qemu_virtiofs: Separate PID read code into qemuVirtioFSGetPid virprocess: Core Scheduling support virCommand: Introduce APIs for core scheduling qemu_conf: Introduce a knob to turn off SCHED_CORE qemu: Enable SCHED_CORE for domains and helper processes qemu: Place helper processes into the same trusted group src/libvirt_private.syms | 6 + src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 5 + src/qemu/qemu_conf.c | 24 ++++ src/qemu/qemu_conf.h | 2 + src/qemu/qemu_dbus.c | 42 ++++--- src/qemu/qemu_dbus.h | 4 + src/qemu/qemu_extdevice.c | 171 ++++++++++++++++++++++++++--- src/qemu/qemu_extdevice.h | 3 + src/qemu/qemu_process.c | 9 ++ src/qemu/qemu_security.c | 4 + src/qemu/qemu_tpm.c | 91 +++++---------- src/qemu/qemu_tpm.h | 18 ++- src/qemu/qemu_vhost_user_gpu.c | 2 +- src/qemu/qemu_vhost_user_gpu.h | 8 ++ src/qemu/qemu_virtiofs.c | 41 ++++--- src/qemu/qemu_virtiofs.h | 5 + src/qemu/test_libvirtd_qemu.aug.in | 1 + src/util/vircommand.c | 74 +++++++++++++ src/util/vircommand.h | 5 + src/util/virprocess.c | 124 +++++++++++++++++++++ src/util/virprocess.h | 8 ++ 22 files changed, 538 insertions(+), 110 deletions(-) -- 2.35.1
On Mon, May 09, 2022 at 05:02:07PM +0200, Michal Privoznik wrote: > The Linux kernel offers a way to mitigate side channel attacks on Hyper > Threads (e.g. MDS and L1TF). Long story short, userspace can define > groups of processes (aka trusted groups) and only processes within one > group can run on sibling Hyper Threads. The group membership is > automatically preserved on fork() and exec(). > > Now, there is one scenario which I don't cover in my series and I'd like > to hear proposal: if there are two guests with odd number of vCPUs they > can no longer run on sibling Hyper Threads because my patches create > separate group for each QEMU. This is a performance penalty. Ideally, we > would have a knob inside domain XML that would place two or more domains > into the same trusted group. But since there's pre-existing example (of > sharing a piece of information between two domains) I've failed to come > up with something usable. Right now users have two choices - Run with SMT enabled. 100% of CPUs available. VMs are vulnerable - Run with SMT disabled. 50% of CPUs available. VMs are safe What the core scheduling gives is somewhere inbetween, depending on the vCPU count. If we assume all guests have even CPUs then - Run with SMT enabled + core scheduling. 100% of CPUs available. 100% of CPUs are used, VMs are safe This is the ideal scenario, and probably the fairly common scenario too as IMHO even number CPU counts are likely to be typical. If we assume the worst case, of entirely 1 vCPU guests then we have - Run with SMT enabled + core scheduling. 100% of CPUs available. 50% of CPUs are used, VMs are safe This feels highly unlikely though, as all except tiny workloads want > 1 vCPU. With entirely 3 vCPU guests then we have - Run with SMT enabled + core scheduling. 100% of CPUs available. 75% of CPUs are used, VMs are safe With entirely 5 vCPU guests then we have - Run with SMT enabled + core scheduling. 100% of CPUs available. 83% of CPUs are used, VMs are safe If we have a mix of even and odd numbered vCPU guests, with mostly even numbered, then I think utilization will be high enough that almost no one will care about the last few %. While we could try to come up with a way to express sharing of cores between VMs I don't think its worth it, in the absence of someone presenting compelling data why it'll be needed in a non niche use case. Bear in mind, that users can also resort to pinning VMs explicitly to get sharing. In terms of defaults I'd very much like us to default to enabling core scheduling, so that we have a secure deployment out of the box. The only caveat is that this does have the potential to be interpreted as a regression for existing deployments in some cases. Perhaps we should make it a meson option for distros to decide whether to ship with it turned on out of the box or not ? I don't think we need core scheduling to be a VM XML config option, because security is really a host level matter IMHO, such that it does't make sense to have both secure & insecure VMs co-located. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Mon, 2022-05-23 at 17:13 +0100, Daniel P. Berrangé wrote: > On Mon, May 09, 2022 at 05:02:07PM +0200, Michal Privoznik wrote: > In terms of defaults I'd very much like us to default to enabling > core scheduling, so that we have a secure deployment out of the box. > The only caveat is that this does have the potential to be > interpreted > as a regression for existing deployments in some cases. Perhaps we > should make it a meson option for distros to decide whether to ship > with it turned on out of the box or not ? > I think, as Michal said, with the qemu.conf knob from patch 8, we will already have that. I.e., distros will ship a qemu.conf with sched_core equal to 1 or 0, depending on what they want as a default behavior. > I don't think we need core scheduling to be a VM XML config option, > because security is really a host level matter IMHO, such that it > does't make sense to have both secure & insecure VMs co-located. > Mmm... I can't say that I have any concrete example, but I guess I can picture a situation where someone has "sensitive" VMs, which he/she would want to make sure they're protected from the possibility that other VMs steal their secrets, and "less sensitive" ones, for which it's not a concern if they share cores and (potentially) steal secrets among each others (as far as none of those can steal from any "sensitive" one, but this does not happen, if we set core scheduling for the latter). Another scenario would be if core-scheduling is (ab)used for limiting the interference. Like some sort of flexible and dynamic form of vcpu- pinning. That is, if I set core-scheduling for VM1, I'm sure that VM1's vcpus will never share cores with any other VMs. Which is good for performance and determinism, because it means that it can't happen that vcpu1 of VM3 runs on the same core of vcpu0 of VM1 and, when VM3-vcpu1 is busy, VM1-vcpu0 slows down as well. Imagine that VM1 and VM3 are owned by different customers, core-scheduling would allow me to make sure that whatever customer A is doing in VM3, it can't slow down customer B, who owns VM1, without having to resort to do vcpu-pinning, which is unflexible and. And again, maybe we do want this "dynamic interference shielding" property for some VMs, but not for all... E.g., we can offer it as an higher SLA, and ask more money for a VM that has it. Thoughts? In any case, even if we decide that we do want per-VM core-scheduling, e.g., for the above mentioned reasons, I guess it can come later, as a further improvement (and I'd be happy to help making it happen). Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)
On 5/26/22 14:01, Dario Faggioli wrote: > On Mon, 2022-05-23 at 17:13 +0100, Daniel P. Berrangé wrote: >> On Mon, May 09, 2022 at 05:02:07PM +0200, Michal Privoznik wrote: >> In terms of defaults I'd very much like us to default to enabling >> core scheduling, so that we have a secure deployment out of the box. >> The only caveat is that this does have the potential to be >> interpreted >> as a regression for existing deployments in some cases. Perhaps we >> should make it a meson option for distros to decide whether to ship >> with it turned on out of the box or not ? >> > I think, as Michal said, with the qemu.conf knob from patch 8, we will > already have that. I.e., distros will ship a qemu.conf with sched_core > equal to 1 or 0, depending on what they want as a default behavior. > >> I don't think we need core scheduling to be a VM XML config option, >> because security is really a host level matter IMHO, such that it >> does't make sense to have both secure & insecure VMs co-located. >> > Mmm... I can't say that I have any concrete example, but I guess I can > picture a situation where someone has "sensitive" VMs, which he/she > would want to make sure they're protected from the possibility that > other VMs steal their secrets, and "less sensitive" ones, for which > it's not a concern if they share cores and (potentially) steal secrets > among each others (as far as none of those can steal from any > "sensitive" one, but this does not happen, if we set core scheduling > for the latter). > > Another scenario would be if core-scheduling is (ab)used for limiting > the interference. Like some sort of flexible and dynamic form of vcpu- > pinning. That is, if I set core-scheduling for VM1, I'm sure that VM1's > vcpus will never share cores with any other VMs. Which is good for > performance and determinism, because it means that it can't happen that > vcpu1 of VM3 runs on the same core of vcpu0 of VM1 and, when VM3-vcpu1 > is busy, VM1-vcpu0 slows down as well. Imagine that VM1 and VM3 are > owned by different customers, core-scheduling would allow me to make > sure that whatever customer A is doing in VM3, it can't slow down > customer B, who owns VM1, without having to resort to do vcpu-pinning, > which is unflexible and. And again, maybe we do want this "dynamic > interference shielding" property for some VMs, but not for all... E.g., > we can offer it as an higher SLA, and ask more money for a VM that has > it. > > Thoughts? I'd expect the host scheduler to work around this problem, e.g. it could run vCPUs of different VMs on different cores. Of course, this assumes that they are allowed to run on different cores (i.e. they are not pinned onto the same physical CPU). And if they are then that's obviously a misconfig on admin side. Michal
On Thu, 2022-05-26 at 14:01 +0200, Dario Faggioli wrote: > Thoughts? > Oh, and there are even a couple of other (potential) use case, for having an (even more!) fine grained control of core-scheduling. So, right now, giving a virtual topology to a VM, pretty much only makes sense if the VM has its vcpus pinned. Well, actually, there's something that we can do even if that is not the case, especially if we define at least *some* constraints on where the vcpus can run, even if we don't have strict and static 1-to-1 pinning... But for sure we shouldn't define an SMT topology, if we don't have that (i.e., if we don't have strict and static 1-to-1 pinning). And yet, the vcpus will run on cores and threads! Now, if we implement per-vcpu core-scheduling (which means being able to put not necessarily whole VMs, but single vcpus [although, of the same VM], in trusted groups), then we can: - put vcpu0 and vcpu1 of VM1 in a group - put vcpu2 and vcpu3 of VM1 in a(nother!) group - define, in the virtual topology of VM1, vcpu0 and vcpu1 as SMT-threads of the same core - define, in the virtual topology of VM1, vcpu2 and vcpu3 as SMT-threads of the same core From the perspective of the accuracy of the mapping between virtual and physical topology (and hence, most likely, of performance), it's still a mixed bag. I.e., on an idle or lightly loaded system, vcpu0 and vcpu1 can still run on two different cores. So, if the guest kernel and apps assume that the two vcpus are SMT-siblings, and optimize for that, well, that still might be false/wrong (like it would be without any core-scheduling, without any pinning, etc). At least, when they run on different cores, they run there alone, which is nice (but achievable with per-VM core-scheduling already). On a heavily loaded system, instead, vcpu0 and vcpu1 should (when they both want to run) have much higher chances of actually ending up running on the same core. [Of couse, not necessarily always on one same specific core --like when we do pinning-- but always on the one core.] So, in-guest workloads operating under the assumption that those two vcpus are SMT-siblings, will hopefully benefit from that. And for the lightly loaded case, well, I believe that combining per-vcpu core-scheduling + SMT virtual topology with *some* kind of vcpu affinity (and I mean something more flexible and less wasteful than 1-to-1 pinning, of course!) and/or with something like numad, will actually bring some performance and determinism benefits, even in such a scenario... But, of course, we need data for that, and I don't have any yet. :-) Anyway, let's now consider the case where the user/customer wants to be able use core-scheduling _inside_ of the guest, e.g., for protecting and/or shielding, some sensitive workload that he/she is running inside of the VM itself, from all the other tasks. But for using core- scheduling inside of the guest we need the guest to have cores and threads. And for the protection/shielding to be effective, we need to be sure that, say, if two guest tasks are in the same trusted group and are running on two vcpus that are virtual SMT-siblings, these two vcpus either (1) run on two actual physical SMT-siblings pCPUs on the host (i.e., on they run on the same core), or (2) run on different host cores, each one on a thread, with no other vCPU from any other VM (and no host task, for what matters) running on the other thread. And this is exactly what per-vcpu core-scheduling + SMT virtual topology gives us. :-D Of course, as in the previous message, I think that it's perfectly fine for something like this to not be implemented immediately, and come later. At least as far as we don't do anything at this stage that will prevent/make it difficult to implement such extensions in future. Which I guess is, after all, the main point of these very long emails (sorry!) that I am writing. I.e., _if_ we agree that it might be interesting to have per-VM, or even per-vcpu, core-scheduling in the future, let's just try to make sure that what we put together now (especially at the interface level) is easy to extend in that direction. :-) Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)
On 5/26/22 16:00, Dario Faggioli wrote: > On Thu, 2022-05-26 at 14:01 +0200, Dario Faggioli wrote: >> Thoughts? >> > Oh, and there are even a couple of other (potential) use case, for > having an (even more!) fine grained control of core-scheduling. > > So, right now, giving a virtual topology to a VM, pretty much only > makes sense if the VM has its vcpus pinned. Well, actually, there's > something that we can do even if that is not the case, especially if we > define at least *some* constraints on where the vcpus can run, even if > we don't have strict and static 1-to-1 pinning... But for sure we > shouldn't define an SMT topology, if we don't have that (i.e., if we > don't have strict and static 1-to-1 pinning). And yet, the vcpus will > run on cores and threads! > > Now, if we implement per-vcpu core-scheduling (which means being able > to put not necessarily whole VMs, but single vcpus [although, of the > same VM], in trusted groups), then we can: > - put vcpu0 and vcpu1 of VM1 in a group > - put vcpu2 and vcpu3 of VM1 in a(nother!) group > - define, in the virtual topology of VM1, vcpu0 and vcpu1 as > SMT-threads of the same core > - define, in the virtual topology of VM1, vcpu2 and vcpu3 as > SMT-threads of the same core These last two we can't do ourselves really. It has to come from domain definition. Otherwise we might break guest ABI, because unless configured in domain XML all vCPUs are different cores (e.g. <vcpu>4</vcpu> gives you a 4 core vCPU). What we could do is to utilize cpu topology, regardless of pinning. I mean, for the following config: <vcpu>4</vcpu> <cpu> <topology sockets='1' dies='1' cores='2' threads='2'/> </cpu> which gives you two threads in two cores. Now, we could place threads of one core into one group, and the other two threads of the other core into another group. Ideally, I'd like to avoid computing an intersection with pinning because that will get hairy pretty quickly (as you demonstrated in this e-mail). For properly pinned vCPUs this won't be any performance penalty (yes, it's still possible to come up with an artificial counter example), and for "misconfigured" pinning, well tough luck. Michal
On 5/23/22 18:13, Daniel P. Berrangé wrote: > On Mon, May 09, 2022 at 05:02:07PM +0200, Michal Privoznik wrote: >> The Linux kernel offers a way to mitigate side channel attacks on Hyper >> Threads (e.g. MDS and L1TF). Long story short, userspace can define >> groups of processes (aka trusted groups) and only processes within one >> group can run on sibling Hyper Threads. The group membership is >> automatically preserved on fork() and exec(). >> >> Now, there is one scenario which I don't cover in my series and I'd like >> to hear proposal: if there are two guests with odd number of vCPUs they >> can no longer run on sibling Hyper Threads because my patches create >> separate group for each QEMU. This is a performance penalty. Ideally, we >> would have a knob inside domain XML that would place two or more domains >> into the same trusted group. But since there's pre-existing example (of >> sharing a piece of information between two domains) I've failed to come >> up with something usable. > > Right now users have two choices > > - Run with SMT enabled. 100% of CPUs available. VMs are vulnerable > - Run with SMT disabled. 50% of CPUs available. VMs are safe > > What the core scheduling gives is somewhere inbetween, depending on > the vCPU count. If we assume all guests have even CPUs then > > - Run with SMT enabled + core scheduling. 100% of CPUs available. > 100% of CPUs are used, VMs are safe > > This is the ideal scenario, and probably the fairly common scenario > too as IMHO even number CPU counts are likely to be typical. > > If we assume the worst case, of entirely 1 vCPU guests then we have > > - Run with SMT enabled + core scheduling. 100% of CPUs available. > 50% of CPUs are used, VMs are safe > > This feels highly unlikely though, as all except tiny workloads > want > 1 vCPU. > > With entirely 3 vCPU guests then we have > > - Run with SMT enabled + core scheduling. 100% of CPUs available. > 75% of CPUs are used, VMs are safe > > With entirely 5 vCPU guests then we have > > - Run with SMT enabled + core scheduling. 100% of CPUs available. > 83% of CPUs are used, VMs are safe > > If we have a mix of even and odd numbered vCPU guests, with mostly > even numbered, then I think utilization will be high enough that > almost no one will care about the last few %. > > While we could try to come up with a way to express sharing of > cores between VMs I don't think its worth it, in the absence of > someone presenting compelling data why it'll be needed in a non > niche use case. Bear in mind, that users can also resort to > pinning VMs explicitly to get sharing. > > In terms of defaults I'd very much like us to default to enabling > core scheduling, so that we have a secure deployment out of the box. > The only caveat is that this does have the potential to be interpreted > as a regression for existing deployments in some cases. Perhaps we > should make it a meson option for distros to decide whether to ship > with it turned on out of the box or not ? Alternatively, distros can just patch qemu_conf.c which enables the option in cfg (virQEMUDriverConfigNew()). Michal
On 5/9/22 17:02, Michal Privoznik wrote: > Polite ping. Michal
© 2016 - 2024 Red Hat, Inc.