Hello there! Several weeks ago, a group of Red Hatters working on the virtualization stack (primarily QEMU and libvirt) started a conversation with developers from the KubeVirt project with the goal of better understanding and documenting the interactions between the two. Specifically, we were interested in integration pain points, with the underlying ideas being that only once those issues are understood it becomes possible to look for solutions, and that better communication would naturally lead to improvements on both sides. This series of documents was born out of that conversation. We're sharing them with the QEMU and libvirt communities in the hope that they can be a valuable resource for understanding how the projects they're working on are consumed by higher-level tools, and what challenges are encountered in the process. Note that, while the documents describe a number of potential directions for things like development of new components, that's all just brainstorming that naturally occurred as we were learning new things: the actual design process should, and will, happen on the upstream lists. Right now the documents live in their own little git repository[1], but the expectation is that eventually they will find a suitable long-term home. The most likely candidate right now is the main KubeVirt repository, but if you have other locations in mind please do speak up! I'm also aware of the fact that this delivery mechanism is fairly unconventional, but I thought it would be the best way to spark a discussion around these topics with the QEMU and libvirt developers. Last but not least, please keep in mind that the documents are a work in progress, and polish has been applied to them unevenly: while the information presented is, to the best of our knowledge, all accurate, some parts are in a rougher state than others. Improvements will hopefully come over time - and if you feel like helping out in making that happen, it would certainly be appreciated! Looking forward to your feedback :) [1] https://gitlab.com/abologna/kubevirt-and-kvm -- Andrea Bolognani / Red Hat / Virtualization
Hi Andrea, On 9/16/20 6:44 PM, Andrea Bolognani wrote: > Hello there! > > Several weeks ago, a group of Red Hatters working on the > virtualization stack (primarily QEMU and libvirt) started a > conversation with developers from the KubeVirt project with the goal > of better understanding and documenting the interactions between the > two. > > Specifically, we were interested in integration pain points, with the > underlying ideas being that only once those issues are understood it > becomes possible to look for solutions, and that better communication > would naturally lead to improvements on both sides. > > This series of documents was born out of that conversation. We're > sharing them with the QEMU and libvirt communities in the hope that > they can be a valuable resource for understanding how the projects > they're working on are consumed by higher-level tools, and what > challenges are encountered in the process. > > Note that, while the documents describe a number of potential > directions for things like development of new components, that's all > just brainstorming that naturally occurred as we were learning new > things: the actual design process should, and will, happen on the > upstream lists. > > Right now the documents live in their own little git repository[1], > but the expectation is that eventually they will find a suitable > long-term home. The most likely candidate right now is the main > KubeVirt repository, but if you have other locations in mind please > do speak up! > > I'm also aware of the fact that this delivery mechanism is fairly > unconventional, but I thought it would be the best way to spark a > discussion around these topics with the QEMU and libvirt developers. > > Last but not least, please keep in mind that the documents are a work > in progress, and polish has been applied to them unevenly: while the > information presented is, to the best of our knowledge, all accurate, > some parts are in a rougher state than others. Improvements will > hopefully come over time - and if you feel like helping out in making > that happen, it would certainly be appreciated! > > Looking forward to your feedback :) > > > [1] https://gitlab.com/abologna/kubevirt-and-kvm Thanks a lot for this documentation, I could learn new things, use cases out of my interest area. Useful as a developer to better understand how are used the areas I'm coding. This shorten a bit that gap between developers and users. What would be more valuable than a developer review/feedback is having feedback from users and technical writers. Suggestion: also share it on qemu-discuss@nongnu.org which is less technical (maybe simply repost the cover and link to the Wiki). -- What is not obvious in this cover (and the documents pasted on the list) is there are schema pictures on the Wiki pages which are not viewable and appreciable via an email post. -- I had zero knowledge on Kubernetes. I have been confused by their use in the introduction... >From Index: "The intended audience is people who are familiar with the traditional virtualization stack (QEMU plus libvirt), and in order to make it more approachable to them comparisons, are included and little to no knowledge of KubeVirt or Kubernetes is assumed." Then in Architecture's {Goals and Components} there is an assumption Kubernetes is known. Entering in Components, Kubernetes is briefly but enough explained. Then KubeVirt is very well explained. -- Sometimes the "Other topics" category is confusing, it seems out of the scope of the "better understanding and documenting the interactions between KubeVirt and KVM" and looks like left over notes. I.e.: "Another possibility is to leverage the device-mapper from Linux to provide features such as snapshots and even like Incremental Backup. For example, dm-era seems to provide the basic primitives for bitmap tracking. This could be part of scenario number 1, or cascaded with other PVs somewhere else. Is this already being used? For example, cybozu-go/topolvm is a CSI LVM Plugin for k8s." "vhost-user-blk in other CSI backends Would it make sense for other CSI backends to implement support for vhost-user-blk?" "The audience is people who are familiar with the traditional virtualization stack (QEMU plus libvirt)". Feeling part of the audience, I have no clue how to answer these questions... I'd prefer you tell me :) Maybe renaming the "Other topics" section would help. "Unanswered questions", "Other possibilities to investigate",... -- Very good contribution in documentation, Thanks! Phil.
On Tue, 2020-09-22 at 11:29 +0200, Philippe Mathieu-Daudé wrote: > Hi Andrea, Hi Philippe, and sorry for the delay in answering! First of all, thanks for taking the time to go through the documents and posting your thoughts. More comments below. > Thanks a lot for this documentation, I could learn new things, > use cases out of my interest area. Useful as a developer to > better understand how are used the areas I'm coding. This > shorten a bit that gap between developers and users. > > What would be more valuable than a developer review/feedback is > having feedback from users and technical writers. > Suggestion: also share it on qemu-discuss@nongnu.org which is > less technical (maybe simply repost the cover and link to the > Wiki). More eyes would obviously be good, but note that these are really intended to improve the interactions between QEMU/libvirt and KubeVirt, so the audience is ultimately developers. Of course you could say that KubeVirt developers *are* users when it comes to QEMU/libvirt, and you wouldn't be wrong ;) Still, qemu-devel seems like the proper venue. > What is not obvious in this cover (and the documents pasted on > the list) is there are schema pictures on the Wiki pages which > are not viewable and appreciable via an email post. You're right! I was pretty sure I had a line about that somewhere in there but I guess it got lost during editing. Hopefully the URL at the very beginning of each document caused people to browse the HTML version. > I had zero knowledge on Kubernetes. I have been confused by their > use in the introduction... > > From Index: > > "The intended audience is people who are familiar with the traditional > virtualization stack (QEMU plus libvirt), and in order to make it > more approachable to them comparisons, are included and little to no > knowledge of KubeVirt or Kubernetes is assumed." > > Then in Architecture's {Goals and Components} there is an assumption > Kubernetes is known. Entering in Components, Kubernetes is briefly > but enough explained. > > Then KubeVirt is very well explained. I guess the sections in the Index you're referring to assume that you know that Kubernetes is somehow connected to containers, and that it's a clustered environment. Anything else I missed? Perhaps we could move the contents of https://gitlab.cee.redhat.com/abologna/kubevirt-and-kvm/-/blob/master/Components.md#kubernetes to a small document that's linked to near the very top. Would that improve things, in your opinion? > Sometimes the "Other topics" category is confusing, it seems out > of the scope of the "better understanding and documenting the > interactions between KubeVirt and KVM" and looks like left over > notes. That's probably because they absolutely are O:-) > Maybe renaming the "Other topics" section would help. > "Unanswered questions", "Other possibilities to investigate",... This sounds sensible :) Thanks again for your feedback! -- Andrea Bolognani / Red Hat / Virtualization
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Index.md # KubeVirt and the KVM user space This is the entry point to a series of documents which, together, detail the current status of KubeVirt and how it interacts with the KVM user space. The intended audience is people who are familiar with the traditional virtualization stack (QEMU plus libvirt), and in order to make it more approachable to them comparisons, are included and little to no knowledge of KubeVirt or Kubernetes is assumed. Each section contains a short summary as well as a link to a separate document discussing the topic in more detail, with the intention that readers will be able to quickly get a high-level understading of the various topics by reading this overview document and then dig further into the specific areas they're interested in. ## Architecture ### Goals * KubeVirt aims to feel completely native to Kubernetes users * VMs should behave like containers whenever possible * There should be no features that are limited to VMs when it would make sense to implement them for containers too * KubeVirt also aims to support all the workloads that traditional virtualization can handle * Windows support, device assignment etc. are all fair game * When these two goals clash, integration with Kubernetes usually wins ### Components * KubeVirt is made up of various discrete components that interact with Kubernetes and the KVM user space * The overall design is somewhat similar to that of libvirt, except with a much higher granularity and many of the tasks offloaded to Kubernetes * Some of the components run at the cluster level or host level with very high privileges, others run at the pod level with significantly reduced privileges Additional information: [Components][] ### Runtime environment * QEMU expects its environment to be set up in advance, something that is typically taken care of by libvirt * libvirtd, when not running in session mode, assumes that it has root-level access to the system and can perform pretty much any privileged operation * In Kubernetes, the runtime environment is usually heavily locked down and many privileged operations are not permitted * Requiring additional permissions for VMs goes against the goal, mentioned earlier, to have VMs behave the same as containers whenever possible ## Specific areas ### Hotplug * QEMU supports hotplug (and hot-unplug) of most devices, and its use is extremely common * Conversely, resources associated with containers such as storage volumes, network interfaces and CPU shares are allocated upfront and do not change throughout the life of the workload * If the container needs more (or less) resources, the Kubernetes approach is to destroy the existing one and schedule a new one to take over its role Additional information: [Hotplug][] ### Storage * Handled through the same Kubernetes APIs used for containers * QEMU / libvirt only see an image file and don't have direct access to the underlying storage implementation * This makes certain scenarios that are common in the virtualization world very challenging: examples include hotplug and full VM snapshots (storage plus memory) * It might be possible to remove some of these limitations by changing the way storage is exposed to QEMU, or even take advantage of the storage technologies that QEMU already implements and make them available to containers in addition to VMs. Additional information: [Storage][] ### Networking * Application processes running in VMs are hidden behind a network interface as opposed to local sockets and processes running in a separated user namespace * Service meshes proxy and monitor applications by means of socket redirection and classification on local ports and process identifiers. We need to aim for generic compatibility * Existing solutions replicate a full TCP/IP stack to pretend applications running in a QEMU instance are local. No chances for zero-copy and context switching avoidance * Network connectivity is shared between control plane and workload itself. Addressing and port mapping need particular attention * Linux capabilities granted to the pod might be minimal, or none at all. Live migration presents further challenges in terms of network addressing and port mapping Additional information: [Networking][] ### Live migration * QEMU supports live migration between hosts, usually coordinated by libvirt * Kubernetes expects containers to be disposable, so the equivalent of live migration would be to simply destroy the ones running on the source node and schedule replacements on the destination node * For KubeVirt, a hybrid approach is used: a new container is created on the target node, then the VM is migrated from the old container, running on the source node, to the newly-created one Additional information: [Live migration][] ### CPU pinning * CPU pinning is not handled by QEMU directly, but is instead delegated to libvirt * KubeVirt figures out which CPUs are assigned to the container after it has been started by Kubernetes, and passes that information to libvirt so that it can perform CPU pinning Additional information: [CPU pinning][] ### NUMA pinning * NUMA pinning is not handled by QEMU directly, but is instead delegated to libvirt * KubeVirt doesn't implement NUMA pinning at the moment Additional information: [NUMA pinning][] ### Isolation * For security reasons, it's a good idea to run each QEMU process in an environment that is isolated from the host as well as other VMs * This includes using a separate unprivileged user account, setting up namespaces and cgroups, using SELinux... * QEMU doesn't take care of this itself and delegates it to libvirt * Most of these techniques serve as the base for containers, so KubeVirt can rely on Kubernetes providing a similar level of isolation without further intervention Additional information: [Isolation][] ## Other tidbits ### Upgrades * When libvirt is upgraded, running VMs keep using the old QEMU binary: the new QEMU binary is used for newly-started VMs as well as after VMs have been power cycled or migrated * KubeVirt behaves similarly, with the old version of libvirt and QEMU remaining in use for running VMs Additional information [Upgrades][] ### Backpropagation * Applications using libvirt usually don't provide all information, eg. a full PCI topology, and let libvirt fill in the blanks * This might require a second step where the additional information is collected and stored along with the original one * Backpropagation doesn't fit well in Kubernetes' declarative model, so KubeVirt doesn't currently perform it Additional information: [Backpropagation][] ## Contacts and credits This information was collected and organized by many people at Red Hat, some of which have agreed to serve as point of contacts for follow-up discussion. Additional information: [Contacts][] [Backpropagation]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Backpropagation.md [CPU pinning]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/CPU-Pinning.md [Components]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Components.md [Contacts]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Contacts.md [Hotplug]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Hotplug.md [Isolation]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Isolation.md [Live migration]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Live-Migration.md [NUMA pinning]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/NUMA-Pinning.md [Networking]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md [Storage]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Storage.md [Upgrades]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Upgrades.md
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Components.md # Components This document describes the various components of the KubeVirt architecture, how they fit together, and how they compare to the traditional virtualization architecture (QEMU + libvirt). ## Traditional architecture For the comparison to make sense, let's start by reviewing the architecture used for traditional virtualization. ![libvirt architecture][Components-Libvirt] (Image taken from the "[Look into libvirt][]" presentation by Osier Yang, which is a bit old but still mostly accurate from a high-level perspective.) In particular, the `libvirtd` process runs with high privileges on the host and is responsible for managing all VMs. When asked to start a VM, the management process will * Prepare the environment by performing a number of privileged operations upfront * Set up CGroups * Set up kernel namespaces * Apply SELinux labels * Configure network devices * Open host files * ... * Start a non-privileged QEMU process in that environment ## Kubernetes To understand how KubeVirt works, it's first necessary to have some knowledge of Kubernetes. In Kubernetes, every user workload runs inside [Pods][]. The pod is the smallest unit of work that Kubernetes will schedule. Some facts about pods: * They consist of multiple containers * The containers share a network namespace * The containers have their own PID and mount namespace * The containers have their own CGroups for CPU, memory, devices and so forth. They are controlled by k8s and can’t be modified from outside. * Pods can be started with extended privileges (`CAP_NICE`, `CAP_NET_RAW`, root user, ...) * The app in the pods can drop the privileges, but the pod can not drop them (`kubectl exec` gives you a shell with the full privileges). Creating pods with elevated privileges is generally frowned upon, and depending on the policy decided by the cluster administrator it might be outright impossible. ## KubeVirt architecture Let's now discuss how KubeVirt is structured. ![KubeVirt architecture][Components-Kubevirt] The main components are: * `virt-launcher`, a copy of which runs inside each pod besides QEMU and libvirt, is the unprivileged component responsible for receiving commands from other KubeVirt components and reporting back events such as VM crashes; * `virt-handler` runs at the node level via a DaemonSet, and is the privileged component which takes care of the VM setup by reaching into the corresponding pod and modifying its namespaces; * `virt-controller` runs at the cluster level and monitors the API server so that it can react to user requests and VM events; * `virt-api`, also running at the cluster level, exposes a few additional APIs that only apply to VMs, such as the "console" and "vnc" actions. When a KubeVirt VM is started: * We request a Pod with certain privileges and resources from Kubernetes. * The kubelet (the node daemon of kubernetes) prepares the environment with the help of a container runtime. * A shim process (virt-launcher) is our main entrypoint in the pod, which starts libvirt * Once our node-daemon (virt-handler) can reach our shim process, it does privileged setup from outside. It reaches into the namespaces and modifies their content as needed. We mostly have to modify the mount and network namespaces. * Once the environment is prepared, virt-handler asks virt-launcher to start a VM via its libvirt component. More information can be found in the [KubeVirt architecture][] page. ## Comparison The two architectures are quite similar from the high-level point of view: in both cases there are a number of privileged components which take care of preparing an environment suitable for running an unprivileged QEMU process in. The difference, however, is that while libvirtd takes care of all this setup itself, in the case of KubeVirt several smaller components are involved: some of these components are privileged just as libvirtd is, but others are not, and some of the tasks are not even performed by KubeVirt itself but rather delegated to the existing Kubernetes infrastructure. ## Use of libvirtd in KubeVirt In the traditional virtualization scenario, `libvirtd` provides a number of useful features on top of those available with plain QEMU, including * support for multiple clients connecting at the same time * management of multiple VMs through a single entry point * remote API access KubeVirt interacts with libvirt under certain conditions that make the features described above irrelevant: * there's only one client talking to libvirt: `virt-handler` * libvirt is only asked to manage a single VM * client and libvirt are running in the same pod, no remote libvirt access [Components-Kubevirt]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Components-Kubevirt.png [Components-Libvirt]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Components-Libvirt.png [KubeVirt architecture]: https://github.com/kubevirt/kubevirt/blob/master/docs/architecture.md [Look into libvirt]: https://www.slideshare.net/ben_duyujie/look-into-libvirt-osier-yang [Pods]: https://kubernetes.io/docs/concepts/workloads/pods/
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Hotplug.md # Hotplug In Kubernetes, pods are defined to be immutable, so it's not possible to perform hotplug of devices in the same way as with the traditional virtualization stack. This limitation is a result of KubeVirt's guiding principle of integrating with Kubernetes as much as possible and making VMs appear the same as containers from the user's point of view. There have been several attempts at lifting this restriction in Kubernetes over the years, but they have all been unsuccessful so far. ## Existing hotplug support When working with containers, changing the amount of resources associated with a pod will result in it being destroyed and a new pod with the updated resource allocation being created in its place. This works fine for containers, which are designed to be clonable and disposable, but when it comes to VMs they usually can't be destroyed on a whim and running multiple instances in parallel is generally not wise even when possible. ## Possible workarounds Until a proper hotplug API makes its way into Kubernetes, one possible way to implement hotplug could be to perform migration to a container compliant with the new allocation request, and only then perform the QEMU-level hotplug operation.
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Storage.md # Storage This document describes the known use-cases and architecture options we have for Linux Virtualization storage in [KubeVirt][]. ## Problem description The main goal of Kubevirt is to leverage the storage subsystem of Kubernetes (built around [CSI][] and [Persistent Volumes][] aka PVs), in order to let both workloads (VMs and containers) leverage the same storage. As a consequence Kubevirt is limited in its use of QEMU storage subsystem and features. That means: * Storage solutions should be implemented in k8s in a way that can be consumed by both containers and VMs. * VMs can only consume (and provide) storage features which are available in the pod, through k8s APIs. For example, a VM will not support disk snapshots if it’s attached to a storage provider that doesn’t support it. Ditto for incremental backup, block jobs, encryption, etc. ## Current situation ### Storage handled outside of QEMU In this scenario, the VM pod uses a [Persistent Volume Claim (PVC)][Persistent Volumes] to give QEMU access to a raw storage device or fs mount, which is provided by a [CSI][] driver. QEMU **doesn’t** handle any of the storage use-cases such as thin provisioning, snapshots, change block tracking, block jobs, etc. This is how things work today in KubeVirt. ![Storage handled outside of QEMU][Storage-Current] Devices and interfaces: * PVC: block or fs * QEMU backend: raw device or raw image * QEMU frontend: virtio-blk * alternative: emulated device for wider compatibility and Windows installations * CDROM (sata) * disk (sata) Pros: * Simplicity * Sharing the same storage model with other pods/containers Cons: * Limited feature-set (fully off-loaded to the storage provider from CSI). * No VM snapshots (disk + memory) * Limited opportunities for fine-tuning and optimizations for high-performance. * Hotplug is challenging, because the set of PVCs in a pod is immutable. Questions and comments * How to optimize this in QEMU? * Can we bypass the block layer for this use-case? Like having SPDK inside the VM pod? * Rust-based storage daemon (e.g. [vhost_user_block][]) running inside the VM pod alongside QEMU (bypassing the block layer) * We should be able to achieve high-performance with local NVME storage here, with multiple polling IOThreads and multi queue. * See [this blog post][PVC resize blog] for information about the PVC resize feature. To implement this for VMs we could have kubevirt watch PVCs and respond to capacity changes with a corresponding call to resize the image file (if applicable) and to notify qemu of the enlarged device. * Features such as incremental backup (CBT) and snapshots could be implemented through a generic CSI backend... Device mapper? Stratis? (See [Other Topics](#other-topics)) ## Possible alternatives ### Storage device passthrough (highest performance) Device passthrough via PCI VFIO, SCSI, or vDPA. No storage use-cases and no CSI, as the device is passed directly to the guest. ![Storage device passthrough][Storage-Passthrough] Devices and interfaces: * N/A (hardware passthrough) Pros: * Highest possible performance (same as host) Cons: * No storage features anywhere outside of the guest. * No live-migration for most cases. ### File-system passthrough (virtio-fs) File mount volumes (directories, actually) can be exposed to QEMU via [virtio-fs][] so that VMs have access to files and directories. ![File-system passthrough (virtio-fs)][Storage-Virtiofs] Devices and interfaces: * PVC: file-system Pros: * Simplicity from the user-perspective * Flexibility * Great for heterogeneous workloads that share data between containers and VMs (ie. OpenShift pipelines) Cons: * Performance when compared to block device passthrough Questions and comments: * Feature is still quite new (The Windows driver is fresh out of the oven) ### QEMU storage daemon in CSI for local storage The qemu-storage-daemon is a user-space daemon that exposes QEMU’s block layer to external users. It’s similar to [SPDK][], but includes the implementation of QEMU block layer features such as snapshots and bitmap tracking for incremental backup (CBT). It also allows the splitting of one single NVMe device, allowing multiple QEMU VMs to share one NVMe disk. In this architecture, the storage daemon runs as part of CSI (control plane), with the data-plane being either a vhost-user-blk interface for QEMU or a fs-mount export for containers. ![QEMU storage daemon in CSI for local storage][Storage-QSD] Devices and interfaces: * CSI: * fs mount with a vhost-user-blk socket for QEMU to open * (OR) fs mount via NBD or FUSE with the actual file-system contents * qemu-storage-daemon backend: NVMe local device w/ raw or qcow2 * alternative: any driver supported by QEMU, such as file-posix. * QEMU frontend: virtio-blk * alternative: any emulated device (CDROM, virtio-scsi, etc) * In this case QEMU itself would be consuming vhost-user-blk and emulating the device for the guest Pros: * The NVMe driver from the storage daemon can support partitioning one NVMe device into multiple blk devices, each shared via a vhost-user-blk connection. * Rich feature set, exposing features already implemented in the QEMU block layer to regular pods/containers: * Snapshots and thin-provisioning (qcow2) * Incremental Backup (CBT) * Compatibility with use-cases from other projects (oVirt, OpenStack, etc) * Snapshots, thin-provisioning, CBT and block jobs via QEMU Cons: * Complexity due to cascading and splitting of components. * Depends on the evolution of CSI APIs to provide the right use-cases. Questions and comments: * Local restrictions: QEMU and qemu-storage-daemon should be running on the same host (for vhost-user-blk shared memory to work). * Need to cascade CSI providers for volume management (resize, creation, etc) * How to share a partitioned NVMe device (from one storage daemon) with multiple pods? * See also: [kubevirt/kubevirt#3208][] (similar idea for vhost-user-net). * We could do hotplugging under the hood with the storage daemon. * To expose a new PV, a new qemu-storage-daemon pod can be created with a corresponding PVC. Conversely, on unplug, the pod can be deleted. Ideally, we might have a 1:1 relationship between PVs and storage daemon pods (though 1:n for attaching multiple guests to a single daemon). * This requires that we can create a new unix socket connection from new storage daemon pods to the VMs. The exact way to achieve this is still to be figured out. According to Adam Litke, the naive way would require elevated privileges for both pods. * After having the socket (either the file or a file descriptor) available in the VM pod, QEMU can connect to it. * In order to avoid a mix of block devices having a PVC in the VM pod and others where we just passed the unix socket, we can completely avoid the PVC case for the VM pod: * For exposing a PV to QEMU, we would always go through the storage daemon (i.e. the PVC moves from the VM pod to the storage daemon pod), so the VM pod always only gets a unix socket connection, unifying the two cases. * Using vhost-user-blk from the storage daemon pod performs the same (or potentially better if this allows for polling that we wouldn’t have done otherwise) as having a PVC directly in the VM pod, so while it looks like an indirection, the actual I/O path would be comparable. * This architecture would also allow using the native Gluster/Ceph/NBD/… block drivers in the QEMU process without making them special (because they wouldn’t use a PVC either), unifying even more cases. * Kubernetes has fairly low per-node Pod limits by default so we may need to be careful about 1:1 Pod/PVC mapping. We may want to support aggregation of multiple storage connections into a single q-s-d Pod. ## Other topics ### Device Mapper Another possibility is to leverage the device-mapper from Linux to provide features such as snapshots and even like Incremental Backup. For example, [dm-era][] seems to provide the basic primitives for bitmap tracking. This could be part of scenario number 1, or cascaded with other PVs somewhere else. Is this already being used? For example, [cybozu-go/topolvm][] is a CSI LVM Plugin for k8s. ### Stratis [Stratis][] seems to be an interesting project to be leveraged in the world of Kubernetes. ### vhost-user-blk in other CSI backends Would it make sense for other CSI backends to implement support for vhost-user-blk? [CSI]: https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/ [KubeVirt]: https://kubevirt.io/ [PVC resize blog]: https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/ [Persistent Volumes]: https://kubernetes.io/docs/concepts/storage/persistent-volumes/ [SPDK]: https://spdk.io/ [Storage-Current]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Current.png [Storage-Passthrough]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Passthrough.png [Storage-QSD]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-QSD.png [Storage-Virtiofs]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Virtiofs.png [Stratis]: https://stratis-storage.github.io/ [cybozu-go/topolvm]: https://github.com/cybozu-go/topolvm [dm-era]: https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/era.html [kubevirt/kubevirt#3208]: https://github.com/kubevirt/kubevirt/pull/3208 [vhost_user_block]: https://github.com/cloud-hypervisor/cloud-hypervisor/tree/master/vhost_user_block [virtio-fs]: https://virtio-fs.gitlab.io/
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md # Networking ## Problem description Service meshes (such as [Istio][], [Linkerd][]) typically expect application processes to run on the same physical host, usually in a separate user namespace. Network namespaces might be used too, for additional isolation. Network traffic to and from local processes is monitored and proxied by redirection and observation of local sockets. `iptables` and `nftables` (collectively referred to as the `netfilter` framework) are the typical Linux facilities providing classification and redirection of packets. ![containers][Networking-Containers] *Service meshes with containers. Typical ingress path: **1.** NIC driver queues buffers for IP processing **2.** `netfilter` rules installed by *service mesh* redirect packets to proxy **3.** IP receive path completes, L4 protocol handler invoked **4.** TCP socket of proxy receives packets **5.** proxy opens TCP socket towards application service **6.** packets get TCP header, ready for classification **7.** `netfilter` rules installed by service mesh forward request to service **8.** local IP routing queues packets for TCP protocol handler **9.** application process receives packets and handles request. Egress path is conceptually symmetrical.* If we move application processes to VMs, sockets and processes are not visible anymore. All the traffic is typically forwarded via interfaces operating at data link level. Socket redirection and port mapping to local processes don't work. ![and now?][Networking-Challenge] *Application process moved to VM: **8.** IP layer enqueues packets to L2 interface towards application **9.** `tap` driver forwards L2 packets to guest **10.** packets are received on `virtio-net` ring buffer **11.** guest driver queues buffers for IP processing **12.** IP receive path completes, L4 protocol handler invoked **13.** TCP socket of application receives packets and handles request. **:warning: Proxy challenge**: the service mesh can't forward packets to local sockets via `netfilter` rules. *Add-on* NAT rules might conflict, as service meshes expect full control of the ruleset. Socket monitoring and PID/UID classification isn't possible.* ## Existing solutions Existing solutions typically implement a full TCP/IP stack, replaying traffic on sockets local to the Pod of the service mesh. This creates the illusion of application processes running on the same host, eventually separated by user namespaces. ![slirp][Networking-Slirp] *Existing solutions introduce a third TCP/IP stack: **8.** local IP routing queues packets for TCP protocol handler **9.** userspace implementation of TCP/IP stack receives packets on local socket, and **10.** forwards L2 encapsulation to `tap` *QEMU* interface (socket back-end).* While being almost transparent to the service mesh infrastructure, this kind of solution comes with a number of downsides: * three different TCP/IP stacks (guest, adaptation and host) need to be traversed for every service request. There are no chances to implement zero-copy mechanisms, and the amount of context switches increases dramatically * addressing needs to be coordinated to create the pretense of consistent addresses and routes between guest and host environments. This typically needs a NAT with masquerading, or some form of packet bridging * the traffic seen by the service mesh and observable externally is a distant replica of the packets forwarded to and from the guest environment: * TCP congestion windows and network buffering mechanisms in general operate differently from what would be naturally expected by the application * protocols carrying addressing information might pose additional challenges, as the applications don't see the same set of addresses and routes as they would if deployed with regular containers ## Experiments ![experiments: thin layer][Networking-Experiments-Thin-Layer] *How can we improve on the existing solutions while maintaining drop-in compatibility? A thin layer implements a TCP adaptation and IP services.* These are some directions we have been exploring so far: * a thinner layer between guest and host, that only implements what's strictly needed to pretend processes are running locally. A further TCP/IP stack is not necessarily needed. Some sort of TCP adaptation is needed, however, if this layer (currently implemented as userspace process) runs without the `CAP_NET_RAW` capability: we can't create raw IP sockets on the Pod, and therefore need to map packets at layer 2 to layer 4 sockets offered by the host kernel * to avoid implementing an actual TCP/IP stack like the one offered by *libslirp*, we can align TCP parameters advertised towards the guest (MSS, congestion window) to the socket parameters provided by the host kernel, probing them via the `TCP_INFO` socket option (introduced in Linux 2.4). Segmentation and reassembly is therefore not needed, providing solid chances to avoid dynamic memory allocation altogether, and congestion control becomes implicitly equivalent as parameters are mirrored between the two sides * to reflect the actual receiving dynamic of the guest and support retransmissions without a permanent userspace buffer, segments are not dequeued (`MSG_PEEK`) until acknowledged by the receiver (application) * similarly, the implementation of the host-side sender adjusts MSS (`TCP_MAXSEG` socket option, since Linux 2.6.28) and advertised window (`TCP_WINDOW_CLAMP`, since Linux 2.4) to the parameters observed from incoming packets * this adaptation layer needs to maintain some of the TCP states, but we can rely on the host kernel TCP implementation for the different states of connections being closed * no particular requirements are placed on the MTU of guest interfaces: if fragments are received, payload from the single fragmented packets can be reassembled by the host kernel as needed, and out-of-order fragments can be safely discarded, as there's no intermediate hop justifying the condition * this layer would connect to `qemu` over a *UNIX domain socket*, instead of a `tap` interface, so that the `CAP_NET_ADMIN` capability doesn't need to be granted to any process on the Pod: no further network interfaces are created on the host * transparent, adaptive mapping of ports to the guest, to avoid the need for explicit port forwarding * security and maintainability goals: no dynamic memory allocation, ~2 000 *LoC* target, no external dependencies ![experiments: ebpf][Networking-Experiments-eBPF] *Additionally, an `eBPF` fast path could be implemented **6.** hooking at socket level, and **7.** mapping IP and Ethernet addresses, with the existing layer implementing connection tracking and slow path* If additional capabilities are granted, the data path can be optimised in several ways: * with `CAP_NET_RAW`: * the adaptation layer can use raw IP sockets instead of L4 sockets, implementing a pure connection tracking, without the need for any TCP logic: the guest operating system implements the single TCP stack needed with this variation * zero-copy mechanisms could be implemented using `vhost-user` and QEMU socket back-ends, instead of relying on a full-fledged layer 2 (Ethernet) interface * with `CAP_BPF` and `CAP_NET_ADMIN`: * context switching in packet forwarding could be avoided by the `sockmap` extension provided by `eBPF`, and programming the `XDP` data hooks for in-kernel data transfers * using eBPF programs, we might want to switch (dynamically?) to the `vhost-net` facility * the userspace process would still need to take care of establishing in-kernel flows, and providing IP and IPv6 services (ARP, DHCP, NDP) for addressing transparency and to avoid the need for further capabilities (e.g. `CAP_NET_BIND_SERVICE`), but the main, fast datapath would reside entirely in the kernel [Istio]: https://istio.io/ [Linkerd]: https://linkerd.io/ [Networking-Challenge]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Challenge.png [Networking-Containers]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Containers.png [Networking-Experiments-Thin-Layer]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Experiments-Thin-Layer.png [Networking-Experiments-eBPF]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Experiments-eBPF.png [Networking-Slirp]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Slirp.png
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Live-Migration.md # Live Migration There are two scenarios where live migration is triggered in KubeVirt * As per user request, by posting a `VirtualMachineInstanceMigration` to the cluster * As per cluster request, for instance on a Node eviction (due lack of resources or maintenance of given Node) In both situations, KubeVirt will use libvirt to handle logic and coordination with QEMU while KubeVirt's components manage the Kubernetes control plane and the cluster's limitations. In short, KubeVirt: * Checks if the target host is capable of migration of the given VM; * Handles single network namespace per Pod by proxying migration data (more at [Networking][]) * Handles cluster resources usage (e.g: bandwidth usage); * Handles cross-version migration; ![Live migration between two nodes][Live-Migration-Flow] ## Limitations Live migration is not possible if: * The VM is configured with cpu-passthrough; * The VM has local or non-shared volume * The Pod is using bridge binding for network access (right side of image below) ![Kubevirt's Pod][Live-Migration-Network] ## More on KubeVirt's Live migration This blog [post on live migration][] explains how to have live migration enabled in KubeVirt's VMs and describes some of its caveats. [Live-Migration-Flow]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Live-Migration-Flow.png [Live-Migration-Network]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Live-Migration-Network.png [Networking]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md [post on live migration]: https://kubevirt.io/2020/Live-migration.html
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/CPU-Pinning.md # CPU pinning As is the case for many of KubeVirt's features, CPU pinning is partially achieved using standard Kubernetes components: this both reduces the amount of new code that has to be written and guarantees better integration with containers running side by side with the VMs. ## Kubernetes CPU Manager The Static policy allocates exclusive CPUs to pod containers in the Guaranteed QoS class which request integer CPUs. On a best-effort basis, the Static policy tries to allocate CPUs topologically in the following order: * Allocate all the CPUs in the same processor socket if available and the container requests at least an entire socket worth of CPUs. * Allocate all the logical CPUs (hyperthreads) from the same physical CPU core if available and the container requests an entire core worth of CPUs. * Allocate any available logical CPU, preferring to acquire CPUs from the same socket. ## KubeVirt dedicated CPU placement KubeVirt relies on the Kubernetes CPU Manager to allocate dedicated CPUs to the `virt-launcher` container. When `virt-launcher` starts, it reads `/sys/fs/cgroup/cpuset/cpuset.cpus` and generates `<vcpupin>` configuration for libvirt based on the information found within. However, affinity changes require `CAP_SYS_NICE` so this additional capability has to be granted to the VM pod. Going forward, we would like to perform the affinity change in `virt-handler` (the privileged component running at the node level), which would allow the VM pod to work without additional capabilities.
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/NUMA-Pinning.md # NUMA pinning KubeVirt doesn't currently implement NUMA pinning due to Kubernetes limitation. ## Kubernetes Topology Manager Allows aligning CPU and peripheral device allocations by NUMA node. Many limitations: * Not scheduler aware. * Doesn’t allow memory alignment. * etc...
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Isolation.md # Isolation How is the QEMU process isolated from the host and from other VMs? ## Traditional virtualization cgroups * managed by libvirt SELinux * libvirt is privileged and QEMU is protected by SELinux policies set by libvirt (SVirt) * QEMU runs with SELinux type `svirt_t` ## KubeVirt cgroups * Managed by kubelet * No involvement from libvirt * Memory limits * When using hard limits, the entire VM can be killed by Kubernetes * Memory consumption estimates are based on heuristics SELinux * KubeVirt is not using SVirt and there are no plans to do so * At the moment, the custom [KubeVirt SELinux policy][] is used to ensure libvirt has sufficient privilege to perform its own setup steps * The standard SELinux type used by containers is `container_t` * KubeVirt would like to eventually use the same for VMs as well Capabilities * The default set of capabilities is fairly conservative * Privileged operation should happen outside of the pod: in KubeVirt's case, a good candidate is `virt-handler`, the privileged components that runs at the node level * Additional capabilities can be requested for a pod * However, this is frowned upon and considered a liability from the security point of view * The cluster admin may even set a security policy that prevent pods from using certain capabilities * In such a scenario, KubeVirt workloads may be entirely unable to run ## Specific examples The following is a list of examples, either historical or current, of scenarios where libvirt's approach to isolation clashed with Kubernetes' and changes on either component were necessary. SELinux * libvirt use of hugetlbfs for hugepages config is disallowed by `container_t` * Possibly fixable by using memfd * [libvirt memoryBacking docs][] * [KubeVirt memfd issue][] * Use of libvirt+QEMU multiqueue tap support is disallowed by `container_t` * And there’s no way to pass in this setup from outside the existing stack * [KubeVirt multiqueue workaround][] extending their SELinux policy to allow `attach_queue` * Passing precreated tap devices to libvirt triggers relabelfrom+relabelto `tun_socket` SELinux access * This may not be virt stacks fault, seems to happen automatically when permissions aren’t correct Capabilities * libvirt performs memory locking for VFIO devices unconditionally * Previously KubeVirt had to grant `CAP_SYS_RESOURCE` to pods. KubeVirt worked around it by duplicating libvirt’s memory pinning calculations so the libvirt action would be a no-op, but that is fragile and may cause the issue to resurface if libvirt calculation logic changes. * References: [libvir-list memlock thread][], [KubeVirt memlock PR][], [libvirt qemuDomainGetMemLockLimitBytes][], [KubeVirt VMI.getMemlockSize][] * virtiofsd requires `CAP_SYS_ADMIN` capability to perform `unshare(CLONE_NEWPID|CLONE_NEWNS)` * This is required for certain use cases like running overlayfs in the VM on top of virtiofs, but is not a requirement for all usecases. * References: [KubeVirt virtiofs PR][], [RHEL virtiofs bug][] * KubeVirt uses libvirt for CPU pinning, which requires the pod to have `CAP_SYS_NICE`. * Long term, KubeVirt would like to handle that pinning in their privileged component virt-handler, so `CAP_SYS_NICE` can be dropped. * Sidenote: libvirt unconditionally requires `CAP_SYS_NICE` when any other running VM is using CPU pinning, however this sounds like a plain old bug. * References: [KubeVirt CPU pinning PR][], [KubeVirt CPU pinning workaround PR][], [RHEL CPU pinning bug][] * libvirt bridge usage used to require `CAP_NET_ADMIN` * This is a historical example for reference. libvirt usage of a bridge device always implied tap device creation, which required `CAP_NET_ADMIN` privileges for the pod * The fix was to teach libvirt to accept a precreated tap device and skip some setup operations on it * Example XML: `<interface type='ethernet'><target dev='mytap0' managed='no'/></interface>` * Kubevirt still hasn’t fully managed to drop `CAP_NET_ADMIN` though * References: [RHEL precreated TAP bug][], [libvirt precreated TAP patches][], [KubeVirt precreated TAP PR][], [KubeVirt NET_ADMIN issue][], [KubeVirt NET_ADMIN issue][] [KubeVirt CPU pinning PR]: https://github.com/kubevirt/kubevirt/pull/1381 [KubeVirt CPU pinning workaround PR]: https://github.com/kubevirt/kubevirt/pull/1648 [KubeVirt NET_ADMIN PR]: https://github.com/kubevirt/kubevirt/pull/3290 [KubeVirt NET_ADMIN issue]: https://github.com/kubevirt/kubevirt/issues/3085 [KubeVirt SELinux policy]: https://github.com/kubevirt/kubevirt/blob/master/cmd/virt-handler/virt_launcher.cil [KubeVirt VMI.getMemlockSize]: https://github.com/kubevirt/kubevirt/blob/f5ffba5f84365155c81d0e2cda4aa709da062230/pkg/virt-handler/isolation/isolation.go#L206 [KubeVirt memfd issue]: https://github.com/kubevirt/kubevirt/issues/3781 [KubeVirt memlock PR]: https://github.com/kubevirt/kubevirt/pull/2584 [KubeVirt multiqueue workaround]: https://github.com/kubevirt/kubevirt/pull/2941/commits/bc55cb916003c54f6cbf329112a4e36d0d874836 [KubeVirt precreated TAP PR]: https://github.com/kubevirt/kubevirt/pull/2837 [KubeVirt virtiofs PR]: https://github.com/kubevirt/kubevirt/pull/3493 [RHEL CPU pinning bug]: https://bugzilla.redhat.com/show_bug.cgi?id=1819801 [RHEL precreated TAP bug]: https://bugzilla.redhat.com/show_bug.cgi?id=1723367 [RHEL virtiofs bug]: https://bugzilla.redhat.com/show_bug.cgi?id=1854595 [libvir-list memlock thread]: https://www.redhat.com/archives/libvirt-users/2019-August/msg00046.html [libvirt memoryBacking docs]: https://libvirt.org/formatdomain.html#elementsMemoryBacking [libvirt precreated TAP patches]: https://www.redhat.com/archives/libvir-list/2019-August/msg01256.html [libvirt qemuDomainGetMemLockLimitBytes]: https://gitlab.com/libvirt/libvirt/-/blob/84bb5fd1ab2bce88e508d416f4bcea520c803ea8/src/qemu/qemu_domain.c#L8712
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Upgrades.md # Upgrades The KubeVirt installation and upgrade process are entirely controlled by an [operator][], which is a common pattern in the Kubernetes world. The operator is a piece of software running in the cluster and managing the lifecycle of other components, in this case KubeVirt. ## The operator What it does: * Manages the whole KubeVirt installation * Keeps the cluster actively in sync with the desired state * Upgrades KubeVirt with zero downtime ## Installation Install the operator: ```bash $ LATEST=$(curl -L https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/latest) $ kubectl apply -f https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/$(LATEST)/kubevirt-operator.yaml $ kubectl get pods -n kubevirt NAME READY STATUS RESTARTS AGE virt-operator-58cf9d6648-c7qph 1/1 Running 0 69s virt-operator-58cf9d6648-pvzw2 1/1 Running 0 69s ``` Trigger the installation of KubeVirt: ```bash $ LATEST=$(curl -L https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/latest) $ kubectl apply -f https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/${LATEST}/kubevirt-cr.yaml $ kubectl get pods -n kubevirt NAME READY STATUS RESTARTS AGE virt-api-8bdd88557-fllhr 1/1 Running 0 59s virt-controller-55ccb8cdcb-5rtp6 1/1 Running 0 43s virt-controller-55ccb8cdcb-v8kr9 1/1 Running 0 43s virt-handler-67pns 1/1 Running 0 43s ``` The process happens in two steps because the operator relies on the KubeVirt [custom resource][] for information on the desired installation, and will not do anything until that resource exists in the cluster. ## Upgrade The upgrading process is similar: * Install the latest operator * Reference the new version in the KubeVirt CustomResource to trigger the actual upgrade ```bash $ kubectl.sh get kubevirt -n kubevirt kubevirt -o yaml apiVersion: kubevirt.io/v1alpha3 kind: KubeVirt metadata: name: kubevirt spec: imageTag: v0.30 certificateRotateStrategy: {} configuration: {} imagePullPolicy: IfNotPresent ``` Note the `imageTag` attribute: when present, the KubeVirt operator will take steps to ensure that the version of KubeVirt that's deployed on the cluster matches its value, which in our case will trigger an upgrade. The following chart explain the upgrade flow in more detail and shows how the various components are affected: ![KubeVirt upgrade flow][Upgrades-Kubevirt] KubeVirt is released as a complete suite: no individual `virt-launcher` releases are planned. Everything is tested together, everything is released together. ## QEMU and libvirt The versions of QEMU and libvirt used for VMs are also tied to the version of KubeVirt and are upgraded along with everything else. * Migrations from old libvirt/QEMU to new libvirt/QEMU pairs are possible * As soon as the new `virt-handler` and the new controller are rolled out, the cluster will only start VMIs with the new QEMU/libvirt versions ## Version compatibility The virt stack is updated along with KubeVirt, which mitigates compatibility concerns. As a rule of thumb, versions of QEMU and libvirt older than a year or so are not taken into consideration. Currently, the ability to perform backwards migation (eg. from a newer version of QEMU to an older one) is not necessary, but that could very well change as KubeVirt becomes more widely used. [Upgrades-Kubevirt]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Upgrades-Kubevirt.png [custom resource]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ [operator]: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Backpropagation.md # Backpropagation Whenever a partial VM configuration is submitted to libvirt, any missing information is automatically filled in to obtain a configuration that's complete enough to guarantee long-term guest ABI stability. PCI addresses are perhaps the most prominent example of this: most management applications don't include this information at all in the XML they submit to libvirt, and rely on libvirt building a reasonable PCI topology to support the requested devices. For example, using a made-up YAML syntax for brevity, the input could look like ```yaml devices: disks: - image: /path/to/image.qcow2 ``` and the output could be augmented by libvirt to look like ```yaml devices: controllers: - model: pcie-root-port address: type: pci domain: 0x0000 bus: 0x00 slot: 0x01 function: 0x0 disks: - image: /path/to/image.qcow2 model: virtio-blk address: type: pci domain: 0x0000 bus: 0x01 slot: 0x00 function: 0x0 ``` This is where backpropagation comes in: the only version of the VM configuration that is complete enough to guarantee a stable guest ABI is the one that includes all information added by libvirt, so if the management application wants to be able to make further changes to the VM it needs to reflect the additional information back into its understanding of the VM configuration somehow. For applications like virsh and virt-manager, this is easy: they don't have their own configuration format or even store the VM configuration, and simply fetch it from libvirt and operate on it directly every single time. oVirt, to the best of my knowledge, generates an initial VM configuration based on the settings provided by the user, submits it to libvirt and then parses back the augmented version, figuring out what information was added and updating its database to match: if the VM configuration needs to be generated again later, it will include all information present in the database, including those that originated from libvirt rather than the user. KubeVirt does not currently perform any backpropagation. There are two ways a user can influence PCI address allocation: * explicitly add a `pciAddress` attribute for the device, which will cause KubeVirt to pass the corresponding address to libvirt, which in turn will attempt to comply with the user's request; * add the `kubevirt.io/placePCIDevicesOnRootComplex` annotation to the VM configuration, which will cause KubeVirt to provide libvirt with a fully-specified PCI topology where all devices live on the PCIe Root Bus. In all cases but the one where KubeVirt defines the full PCI topology itself, it's implicitly relying on libvirt always building the PCI topology in the exact same way every single time in order to have a stable guest ABI. While this works in practice, it's not something that libvirt actually guarantees: once a VM has been defined, libvirt will never change its PCI topology, but submitting the same partial VM configuration to different libvirt versions can result in different PCI topologies.
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Contacts.md # Contacts and credits # Contacts The following people have agreed to serve as points of contact for follow-up discussion around the topics included in these documents. ## Overall * Andrea Bolognani <<abologna@redhat.com>> (KVM user space) * Cole Robinson <<crobinso@redhat.com>> (KVM user space) * Roman Mohr <<rmohr@redhat.com>> (KubeVirt) * Vladik Romanovsky <<vromanso@redhat.com>> (KubeVirt) ## Networking * Alona Paz <<alkaplan@redhat.com>> (KubeVirt) * Stefano Brivio <<sbrivio@redhat.com>> (KVM user space) ## Storage * Adam Litke <<alitke@redhat.com>> (KubeVirt) * Stefan Hajnoczi <<stefanha@redhat.com>> (KVM user space) # Credits In addition to those listed above, the following people have also contributed to the documents or the discussion around them. Ademar Reis, Adrian Moreno Zapata, Alice Frosi, Amnon Ilan, Ariel Adam, Christophe de Dinechin, Dan Kenisberg, David Gilbert, Eduardo Habkost, Fabian Deutsch, Gerd Hoffmann, Jason Wang, John Snow, Kevin Wolf, Marc-André Lureau, Michael Henriksen, Michael Tsirkin, Paolo Bonzini, Peter Krempa, Petr Horacek, Richard Jones, Sergio Lopez, Steve Gordon, Victor Toso, Viviek Goyal. If your name should be in the list above but is not, please know that was an honest mistake and not a way to downplay your contribution! Get in touch and we'll get it sorted out :)
© 2016 - 2024 Red Hat, Inc.