kubevirt-and-kvm: Add documents

[RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents

Posted by Andrea Bolognani 5 years, 4 months ago

Hello there!

Several weeks ago, a group of Red Hatters working on the
virtualization stack (primarily QEMU and libvirt) started a
conversation with developers from the KubeVirt project with the goal
of better understanding and documenting the interactions between the
two.

Specifically, we were interested in integration pain points, with the
underlying ideas being that only once those issues are understood it
becomes possible to look for solutions, and that better communication
would naturally lead to improvements on both sides.

This series of documents was born out of that conversation. We're
sharing them with the QEMU and libvirt communities in the hope that
they can be a valuable resource for understanding how the projects
they're working on are consumed by higher-level tools, and what
challenges are encountered in the process.

Note that, while the documents describe a number of potential
directions for things like development of new components, that's all
just brainstorming that naturally occurred as we were learning new
things: the actual design process should, and will, happen on the
upstream lists.

Right now the documents live in their own little git repository[1],
but the expectation is that eventually they will find a suitable
long-term home. The most likely candidate right now is the main
KubeVirt repository, but if you have other locations in mind please
do speak up!

I'm also aware of the fact that this delivery mechanism is fairly
unconventional, but I thought it would be the best way to spark a
discussion around these topics with the QEMU and libvirt developers.

Last but not least, please keep in mind that the documents are a work
in progress, and polish has been applied to them unevenly: while the
information presented is, to the best of our knowledge, all accurate,
some parts are in a rougher state than others. Improvements will
hopefully come over time - and if you feel like helping out in making
that happen, it would certainly be appreciated!

Looking forward to your feedback :)


[1] https://gitlab.com/abologna/kubevirt-and-kvm
-- 
Andrea Bolognani / Red Hat / Virtualization

Re: [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents

Posted by Philippe Mathieu-Daudé 5 years, 4 months ago

Hi Andrea,

On 9/16/20 6:44 PM, Andrea Bolognani wrote:
> Hello there!
> 
> Several weeks ago, a group of Red Hatters working on the
> virtualization stack (primarily QEMU and libvirt) started a
> conversation with developers from the KubeVirt project with the goal
> of better understanding and documenting the interactions between the
> two.
> 
> Specifically, we were interested in integration pain points, with the
> underlying ideas being that only once those issues are understood it
> becomes possible to look for solutions, and that better communication
> would naturally lead to improvements on both sides.
> 
> This series of documents was born out of that conversation. We're
> sharing them with the QEMU and libvirt communities in the hope that
> they can be a valuable resource for understanding how the projects
> they're working on are consumed by higher-level tools, and what
> challenges are encountered in the process.
> 
> Note that, while the documents describe a number of potential
> directions for things like development of new components, that's all
> just brainstorming that naturally occurred as we were learning new
> things: the actual design process should, and will, happen on the
> upstream lists.
> 
> Right now the documents live in their own little git repository[1],
> but the expectation is that eventually they will find a suitable
> long-term home. The most likely candidate right now is the main
> KubeVirt repository, but if you have other locations in mind please
> do speak up!
> 
> I'm also aware of the fact that this delivery mechanism is fairly
> unconventional, but I thought it would be the best way to spark a
> discussion around these topics with the QEMU and libvirt developers.
> 
> Last but not least, please keep in mind that the documents are a work
> in progress, and polish has been applied to them unevenly: while the
> information presented is, to the best of our knowledge, all accurate,
> some parts are in a rougher state than others. Improvements will
> hopefully come over time - and if you feel like helping out in making
> that happen, it would certainly be appreciated!
> 
> Looking forward to your feedback :)
> 
> 
> [1] https://gitlab.com/abologna/kubevirt-and-kvm

Thanks a lot for this documentation, I could learn new things,
use cases out of my interest area. Useful as a developer to
better understand how are used the areas I'm coding. This
shorten a bit that gap between developers and users.

What would be more valuable than a developer review/feedback is
having feedback from users and technical writers.
Suggestion: also share it on qemu-discuss@nongnu.org which is
less technical (maybe simply repost the cover and link to the
Wiki).

--

What is not obvious in this cover (and the documents pasted on
the list) is there are schema pictures on the Wiki pages which
are not viewable and appreciable via an email post.

--

I had zero knowledge on Kubernetes. I have been confused by their
use in the introduction...

>From Index:

"The intended audience is people who are familiar with the traditional
virtualization stack (QEMU plus libvirt), and in order to make it
more approachable to them comparisons, are included and little to no
knowledge of KubeVirt or Kubernetes is assumed."

Then in Architecture's {Goals and Components} there is an assumption
Kubernetes is known. Entering in Components, Kubernetes is briefly
but enough explained.

Then KubeVirt is very well explained.

--

Sometimes the "Other topics" category is confusing, it seems out
of the scope of the "better understanding and documenting the
interactions between KubeVirt and KVM" and looks like left over
notes. I.e.:

"Another possibility is to leverage the device-mapper from Linux to
provide features such as snapshots and even like Incremental Backup.
For example, dm-era seems to provide the basic primitives for
bitmap tracking.
This could be part of scenario number 1, or cascaded with other PVs
somewhere else.
Is this already being used? For example, cybozu-go/topolvm is a
CSI LVM Plugin for k8s."

"vhost-user-blk in other CSI backends
Would it make sense for other CSI backends to implement support for
vhost-user-blk?"

"The audience is people who are familiar with the traditional
virtualization stack (QEMU plus libvirt)". Feeling part of the
audience, I have no clue how to answer these questions...
I'd prefer you tell me :)

Maybe renaming the "Other topics" section would help.
"Unanswered questions", "Other possibilities to investigate",...

--

Very good contribution in documentation,

Thanks!

Phil.

Re: [RFC DOCUMENT 00/12] kubevirt-and-kvm: Add documents

Posted by Andrea Bolognani 5 years, 4 months ago

On Tue, 2020-09-22 at 11:29 +0200, Philippe Mathieu-Daudé wrote:
> Hi Andrea,

Hi Philippe, and sorry for the delay in answering!

First of all, thanks for taking the time to go through the documents
and posting your thoughts. More comments below.

> Thanks a lot for this documentation, I could learn new things,
> use cases out of my interest area. Useful as a developer to
> better understand how are used the areas I'm coding. This
> shorten a bit that gap between developers and users.
> 
> What would be more valuable than a developer review/feedback is
> having feedback from users and technical writers.
> Suggestion: also share it on qemu-discuss@nongnu.org which is
> less technical (maybe simply repost the cover and link to the
> Wiki).

More eyes would obviously be good, but note that these are really
intended to improve the interactions between QEMU/libvirt and
KubeVirt, so the audience is ultimately developers. Of course you
could say that KubeVirt developers *are* users when it comes to
QEMU/libvirt, and you wouldn't be wrong ;) Still, qemu-devel seems
like the proper venue.

> What is not obvious in this cover (and the documents pasted on
> the list) is there are schema pictures on the Wiki pages which
> are not viewable and appreciable via an email post.

You're right! I was pretty sure I had a line about that somewhere in
there but I guess it got lost during editing. Hopefully the URL at
the very beginning of each document caused people to browse the HTML
version.

> I had zero knowledge on Kubernetes. I have been confused by their
> use in the introduction...
> 
> From Index:
> 
> "The intended audience is people who are familiar with the traditional
> virtualization stack (QEMU plus libvirt), and in order to make it
> more approachable to them comparisons, are included and little to no
> knowledge of KubeVirt or Kubernetes is assumed."
> 
> Then in Architecture's {Goals and Components} there is an assumption
> Kubernetes is known. Entering in Components, Kubernetes is briefly
> but enough explained.
> 
> Then KubeVirt is very well explained.

I guess the sections in the Index you're referring to assume that you
know that Kubernetes is somehow connected to containers, and that
it's a clustered environment. Anything else I missed?

Perhaps we could move the contents of

  https://gitlab.cee.redhat.com/abologna/kubevirt-and-kvm/-/blob/master/Components.md#kubernetes

to a small document that's linked to near the very top. Would that
improve things, in your opinion?

> Sometimes the "Other topics" category is confusing, it seems out
> of the scope of the "better understanding and documenting the
> interactions between KubeVirt and KVM" and looks like left over
> notes.

That's probably because they absolutely are O:-)

> Maybe renaming the "Other topics" section would help.
> "Unanswered questions", "Other possibilities to investigate",...

This sounds sensible :)

Thanks again for your feedback!

-- 
Andrea Bolognani / Red Hat / Virtualization

[RFC DOCUMENT 01/12] kubevirt-and-kvm: Add Index page

Posted by Andrea Bolognani 5 years, 4 months ago

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Index.md

# KubeVirt and the KVM user space

This is the entry point to a series of documents which, together,
detail the current status of KubeVirt and how it interacts with the
KVM user space.

The intended audience is people who are familiar with the traditional
virtualization stack (QEMU plus libvirt), and in order to make it
more approachable to them comparisons, are included and little to no
knowledge of KubeVirt or Kubernetes is assumed.

Each section contains a short summary as well as a link to a separate
document discussing the topic in more detail, with the intention that
readers will be able to quickly get a high-level understading of the
various topics by reading this overview document and then dig further
into the specific areas they're interested in.

## Architecture

### Goals

* KubeVirt aims to feel completely native to Kubernetes users
  * VMs should behave like containers whenever possible
  * There should be no features that are limited to VMs when it would
    make sense to implement them for containers too
* KubeVirt also aims to support all the workloads that traditional
  virtualization can handle
  * Windows support, device assignment etc. are all fair game
* When these two goals clash, integration with Kubernetes usually
  wins

### Components

* KubeVirt is made up of various discrete components that interact
  with Kubernetes and the KVM user space
  * The overall design is somewhat similar to that of libvirt, except
    with a much higher granularity and many of the tasks offloaded to
    Kubernetes
  * Some of the components run at the cluster level or host level
    with very high privileges, others run at the pod level with
    significantly reduced privileges

Additional information: [Components][]

### Runtime environment

* QEMU expects its environment to be set up in advance, something
  that is typically taken care of by libvirt
* libvirtd, when not running in session mode, assumes that it has
  root-level access to the system and can perform pretty much any
  privileged operation
* In Kubernetes, the runtime environment is usually heavily locked
  down and many privileged operations are not permitted
  * Requiring additional permissions for VMs goes against the goal,
    mentioned earlier, to have VMs behave the same as containers
    whenever possible

## Specific areas

### Hotplug

* QEMU supports hotplug (and hot-unplug) of most devices, and its use
  is extremely common
* Conversely, resources associated with containers such as storage
  volumes, network interfaces and CPU shares are allocated upfront
  and do not change throughout the life of the workload
  * If the container needs more (or less) resources, the Kubernetes
    approach is to destroy the existing one and schedule a new one to
    take over its role

Additional information: [Hotplug][]

### Storage

* Handled through the same Kubernetes APIs used for containers
  * QEMU / libvirt only see an image file and don't have direct
    access to the underlying storage implementation
  * This makes certain scenarios that are common in the
    virtualization world very challenging: examples include hotplug
    and full VM snapshots (storage plus memory)
* It might be possible to remove some of these limitations by
  changing the way storage is exposed to QEMU, or even take advantage
  of the storage technologies that QEMU already implements and make
  them available to containers in addition to VMs.

Additional information: [Storage][]

### Networking

* Application processes running in VMs are hidden behind a network
  interface as opposed to local sockets and processes running in
  a separated user namespace
  * Service meshes proxy and monitor applications by means of
    socket redirection and classification on local ports and
    process identifiers. We need to aim for generic compatibility
  * Existing solutions replicate a full TCP/IP stack to pretend
    applications running in a QEMU instance are local. No chances
    for zero-copy and context switching avoidance
* Network connectivity is shared between control plane and workload
  itself. Addressing and port mapping need particular attention
* Linux capabilities granted to the pod might be minimal, or none
  at all. Live migration presents further challenges in terms of
  network addressing and port mapping

Additional information: [Networking][]

### Live migration

* QEMU supports live migration between hosts, usually coordinated by
  libvirt
* Kubernetes expects containers to be disposable, so the equivalent
  of live migration would be to simply destroy the ones running on
  the source node and schedule replacements on the destination node
* For KubeVirt, a hybrid approach is used: a new container is created
  on the target node, then the VM is migrated from the old container,
  running on the source node, to the newly-created one

Additional information: [Live migration][]

### CPU pinning

* CPU pinning is not handled by QEMU directly, but is instead
  delegated to libvirt
* KubeVirt figures out which CPUs are assigned to the container after
  it has been started by Kubernetes, and passes that information to
  libvirt so that it can perform CPU pinning

Additional information: [CPU pinning][]

### NUMA pinning

* NUMA pinning is not handled by QEMU directly, but is instead
  delegated to libvirt
* KubeVirt doesn't implement NUMA pinning at the moment

Additional information: [NUMA pinning][]

### Isolation

* For security reasons, it's a good idea to run each QEMU process in
  an environment that is isolated from the host as well as other VMs
  * This includes using a separate unprivileged user account, setting
    up namespaces and cgroups, using SELinux...
  * QEMU doesn't take care of this itself and delegates it to libvirt
* Most of these techniques serve as the base for containers, so
  KubeVirt can rely on Kubernetes providing a similar level of
  isolation without further intervention

Additional information: [Isolation][]

## Other tidbits

### Upgrades

* When libvirt is upgraded, running VMs keep using the old QEMU
  binary: the new QEMU binary is used for newly-started VMs as well
  as after VMs have been power cycled or migrated
* KubeVirt behaves similarly, with the old version of libvirt and
  QEMU remaining in use for running VMs

Additional information [Upgrades][]

### Backpropagation

* Applications using libvirt usually don't provide all information,
  eg. a full PCI topology, and let libvirt fill in the blanks
  * This might require a second step where the additional information
    is collected and stored along with the original one
* Backpropagation doesn't fit well in Kubernetes' declarative model,
  so KubeVirt doesn't currently perform it

Additional information: [Backpropagation][]

## Contacts and credits

This information was collected and organized by many people at Red
Hat, some of which have agreed to serve as point of contacts for
follow-up discussion.

Additional information: [Contacts][]

[Backpropagation]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Backpropagation.md
[CPU pinning]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/CPU-Pinning.md
[Components]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Components.md
[Contacts]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Contacts.md
[Hotplug]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Hotplug.md
[Isolation]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Isolation.md
[Live migration]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Live-Migration.md
[NUMA pinning]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/NUMA-Pinning.md
[Networking]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
[Storage]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Storage.md
[Upgrades]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Upgrades.md

[RFC DOCUMENT 02/12] kubevirt-and-kvm: Add Components page

Posted by Andrea Bolognani 5 years, 4 months ago

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Components.md

# Components

This document describes the various components of the KubeVirt
architecture, how they fit together, and how they compare to the
traditional virtualization architecture (QEMU + libvirt).

## Traditional architecture

For the comparison to make sense, let's start by reviewing the
architecture used for traditional virtualization.

![libvirt architecture][Components-Libvirt]

(Image taken from the "[Look into libvirt][]" presentation by Osier
Yang, which is a bit old but still mostly accurate from a high-level
perspective.)

In particular, the `libvirtd` process runs with high privileges on
the host and is responsible for managing all VMs.

When asked to start a VM, the management process will

* Prepare the environment by performing a number of privileged
  operations upfront
* Set up CGroups
* Set up kernel namespaces
* Apply SELinux labels
* Configure network devices
* Open host files
* ...
* Start a non-privileged QEMU process in that environment

## Kubernetes

To understand how KubeVirt works, it's first necessary to have some
knowledge of Kubernetes.

In Kubernetes, every user workload runs inside [Pods][]. The pod is
the smallest unit of work that Kubernetes will schedule.

Some facts about pods:

* They consist of multiple containers
* The containers share a network namespace
* The containers have their own PID and mount namespace
* The containers have their own CGroups for CPU, memory, devices and
  so forth. They are controlled by k8s and can’t be modified from
  outside.
* Pods can be started with extended privileges (`CAP_NICE`,
  `CAP_NET_RAW`, root user, ...)
* The app in the pods can drop the privileges, but the pod can not
  drop them (`kubectl exec` gives you a shell with the full
  privileges).

Creating pods with elevated privileges is generally frowned upon, and
depending on the policy decided by the cluster administrator it might
be outright impossible.

## KubeVirt architecture

Let's now discuss how KubeVirt is structured.

![KubeVirt architecture][Components-Kubevirt]

The main components are:

* `virt-launcher`, a copy of which runs inside each pod besides QEMU
  and libvirt, is the unprivileged component responsible for
  receiving commands from other KubeVirt components and reporting
  back events such as VM crashes;
* `virt-handler` runs at the node level via a DaemonSet, and is the
  privileged component which takes care of the VM setup by reaching
  into the corresponding pod and modifying its namespaces;
* `virt-controller` runs at the cluster level and monitors the API
  server so that it can react to user requests and VM events;
* `virt-api`, also running at the cluster level, exposes a few
  additional APIs that only apply to VMs, such as the "console" and
  "vnc" actions.

When a KubeVirt VM is started:

* We request a Pod with certain privileges and resources from
  Kubernetes.
* The kubelet (the node daemon of kubernetes) prepares the
  environment with the help of a container runtime.
* A shim process (virt-launcher) is our main entrypoint in the pod,
  which starts libvirt
* Once our node-daemon (virt-handler) can reach our shim process, it
  does privileged setup from outside. It reaches into the namespaces
  and modifies their content as needed. We mostly have to modify the
  mount and network namespaces.
* Once the environment is prepared, virt-handler asks virt-launcher
  to start a VM via its libvirt component.

More information can be found in the [KubeVirt architecture][] page.

## Comparison

The two architectures are quite similar from the high-level point of
view: in both cases there are a number of privileged components which
take care of preparing an environment suitable for running an
unprivileged QEMU process in.

The difference, however, is that while libvirtd takes care of all
this setup itself, in the case of KubeVirt several smaller components
are involved: some of these components are privileged just as libvirtd
is, but others are not, and some of the tasks are not even performed
by KubeVirt itself but rather delegated to the existing Kubernetes
infrastructure.

## Use of libvirtd in KubeVirt

In the traditional virtualization scenario, `libvirtd` provides a
number of useful features on top of those available with plain QEMU,
including

* support for multiple clients connecting at the same time
* management of multiple VMs through a single entry point
* remote API access

KubeVirt interacts with libvirt under certain conditions that make
the features described above irrelevant:

* there's only one client talking to libvirt: `virt-handler`
* libvirt is only asked to manage a single VM
* client and libvirt are running in the same pod, no remote libvirt
  access

[Components-Kubevirt]: 
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Components-Kubevirt.png
[Components-Libvirt]: 
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Components-Libvirt.png
[KubeVirt architecture]: https://github.com/kubevirt/kubevirt/blob/master/docs/architecture.md
[Look into libvirt]: https://www.slideshare.net/ben_duyujie/look-into-libvirt-osier-yang
[Pods]: https://kubernetes.io/docs/concepts/workloads/pods/

[RFC DOCUMENT 03/12] kubevirt-and-kvm: Add Hotplug page

Posted by Andrea Bolognani 5 years, 4 months ago

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Hotplug.md

# Hotplug

In Kubernetes, pods are defined to be immutable, so it's not possible
to perform hotplug of devices in the same way as with the traditional
virtualization stack.

This limitation is a result of KubeVirt's guiding principle of
integrating with Kubernetes as much as possible and making VMs appear
the same as containers from the user's point of view.

There have been several attempts at lifting this restriction in
Kubernetes over the years, but they have all been unsuccessful so
far.

## Existing hotplug support

When working with containers, changing the amount of resources
associated with a pod will result in it being destroyed and a new
pod with the updated resource allocation being created in its place.

This works fine for containers, which are designed to be clonable and
disposable, but when it comes to VMs they usually can't be destroyed
on a whim and running multiple instances in parallel is generally not
wise even when possible.

## Possible workarounds

Until a proper hotplug API makes its way into Kubernetes, one
possible way to implement hotplug could be to perform migration to a
container compliant with the new allocation request, and only then
perform the QEMU-level hotplug operation.

[RFC DOCUMENT 04/12] kubevirt-and-kvm: Add Storage page

Posted by Andrea Bolognani 5 years, 4 months ago

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Storage.md

# Storage

This document describes the known use-cases and architecture options
we have for Linux Virtualization storage in [KubeVirt][].

## Problem description

The main goal of Kubevirt is to leverage the storage subsystem of
Kubernetes (built around [CSI][] and [Persistent Volumes][] aka PVs),
in order to let both workloads (VMs and containers) leverage the same
storage. As a consequence Kubevirt is limited in its use of QEMU
storage subsystem and features. That means:

* Storage solutions should be implemented in k8s in a way that can be
  consumed by both containers and VMs.
* VMs can only consume (and provide) storage features which are
  available in the pod, through k8s APIs. For example, a VM will not
  support disk snapshots if it’s attached to a storage provider that
  doesn’t support it. Ditto for incremental backup, block jobs,
  encryption, etc.

## Current situation

### Storage handled outside of QEMU

In this scenario, the VM pod uses a [Persistent Volume Claim
(PVC)][Persistent Volumes] to give QEMU access to a raw storage
device or fs mount, which is provided by a [CSI][] driver. QEMU
**doesn’t** handle any of the storage use-cases such as thin
provisioning, snapshots, change block tracking, block jobs, etc.

This is how things work today in KubeVirt.

![Storage handled outside of QEMU][Storage-Current]

Devices and interfaces:

* PVC: block or fs
* QEMU backend: raw device or raw image
* QEMU frontend: virtio-blk
  * alternative: emulated device for wider compatibility and Windows
    installations
  * CDROM (sata)
  * disk (sata)

Pros:

* Simplicity
* Sharing the same storage model with other pods/containers

Cons:

* Limited feature-set (fully off-loaded to the storage provider from
  CSI).
* No VM snapshots (disk + memory)
* Limited opportunities for fine-tuning and optimizations for
  high-performance.
* Hotplug is challenging, because the set of PVCs in a pod is
  immutable.

Questions and comments

* How to optimize this in QEMU?
  * Can we bypass the block layer for this use-case? Like having SPDK
    inside the VM pod?
    * Rust-based storage daemon (e.g. [vhost_user_block][]) running
      inside the VM pod alongside QEMU (bypassing the block layer)
  * We should be able to achieve high-performance with local NVME
    storage here, with multiple polling IOThreads and multi queue.
* See [this blog post][PVC resize blog] for information about the PVC
  resize feature.  To implement this for VMs we could have kubevirt
  watch PVCs and respond to capacity changes with a corresponding
  call to resize the image file (if applicable) and to notify qemu of
  the enlarged device.
* Features such as incremental backup (CBT) and snapshots could be
  implemented through a generic CSI backend... Device mapper?
  Stratis? (See [Other Topics](#other-topics))

## Possible alternatives

### Storage device passthrough (highest performance)

Device passthrough via PCI VFIO, SCSI, or vDPA. No storage use-cases
and no CSI, as the device is passed directly to the guest.

![Storage device passthrough][Storage-Passthrough]

Devices and interfaces:

* N/A (hardware passthrough)

Pros:

* Highest possible performance (same as host)

Cons:

* No storage features anywhere outside of the guest.
* No live-migration for most cases.

### File-system passthrough (virtio-fs)

File mount volumes (directories, actually) can be exposed to QEMU via
[virtio-fs][] so that VMs have access to files and directories.

![File-system passthrough (virtio-fs)][Storage-Virtiofs]

Devices and interfaces:

* PVC: file-system

Pros:

* Simplicity from the user-perspective
* Flexibility
* Great for heterogeneous workloads that share data between
  containers and VMs (ie. OpenShift pipelines)

Cons:

* Performance when compared to block device passthrough

Questions and comments:

* Feature is still quite new (The Windows driver is fresh out of the
  oven)

### QEMU storage daemon in CSI for local storage

The qemu-storage-daemon is a user-space daemon that exposes QEMU’s
block layer to external users. It’s similar to [SPDK][], but includes
the implementation of QEMU block layer features such as snapshots and
bitmap tracking for incremental backup (CBT). It also allows the
splitting of one single NVMe device, allowing multiple QEMU VMs to
share one NVMe disk.

In this architecture, the storage daemon runs as part of CSI (control
plane), with the data-plane being either a vhost-user-blk interface
for QEMU or a fs-mount export for containers.

![QEMU storage daemon in CSI for local storage][Storage-QSD]

Devices and interfaces:

* CSI:
  * fs mount with a vhost-user-blk socket for QEMU to open
  * (OR) fs mount via NBD or FUSE with the actual file-system
    contents
* qemu-storage-daemon backend: NVMe local device w/ raw or qcow2
  * alternative: any driver supported by QEMU, such as file-posix.
* QEMU frontend: virtio-blk
  * alternative: any emulated device (CDROM, virtio-scsi, etc)
  * In this case QEMU itself would be consuming vhost-user-blk and
    emulating the device for the guest

Pros:

* The NVMe driver from the storage daemon can support partitioning
  one NVMe device into multiple blk devices, each shared via a
  vhost-user-blk connection.
* Rich feature set, exposing features already implemented in the QEMU
  block layer to regular pods/containers:
  * Snapshots and thin-provisioning (qcow2)
  * Incremental Backup (CBT)
* Compatibility with use-cases from other projects (oVirt, OpenStack,
  etc)
  * Snapshots, thin-provisioning, CBT and block jobs via QEMU

Cons:

* Complexity due to cascading and splitting of components.
* Depends on the evolution of CSI APIs to provide the right
  use-cases.

Questions and comments:

* Local restrictions: QEMU and qemu-storage-daemon should be running
  on the same host (for vhost-user-blk shared memory to work).
* Need to cascade CSI providers for volume management (resize,
  creation, etc)
* How to share a partitioned NVMe device (from one storage daemon)
  with multiple pods?
* See also: [kubevirt/kubevirt#3208][] (similar idea for
  vhost-user-net).
* We could do hotplugging under the hood with the storage daemon.
  * To expose a new PV, a new qemu-storage-daemon pod can be created
    with a corresponding PVC. Conversely, on unplug, the pod can be
    deleted. Ideally, we might have a 1:1 relationship between PVs
    and storage daemon pods (though 1:n for attaching multiple guests
    to a single daemon).
  * This requires that we can create a new unix socket connection
    from new storage daemon pods to the VMs. The exact way to achieve
    this is still to be figured out. According to Adam Litke, the
    naive way would require elevated privileges for both pods.
  * After having the socket (either the file or a file descriptor)
    available in the VM pod, QEMU can connect to it.
* In order to avoid a mix of block devices having a PVC in the VM pod
  and others where we just passed the unix socket, we can completely
  avoid the PVC case for the VM pod:
  * For exposing a PV to QEMU, we would always go through the storage
    daemon (i.e. the PVC moves from the VM pod to the storage daemon
    pod), so the VM pod always only gets a unix socket connection,
    unifying the two cases.
  * Using vhost-user-blk from the storage daemon pod performs the
    same (or potentially better if this allows for polling that we
    wouldn’t have done otherwise) as having a PVC directly in the VM
    pod, so while it looks like an indirection, the actual I/O path
    would be comparable.
  * This architecture would also allow using the native
    Gluster/Ceph/NBD/… block drivers in the QEMU process without
    making them special (because they wouldn’t use a PVC either),
    unifying even more cases.
  * Kubernetes has fairly low per-node Pod limits by default so we
    may need to be careful about 1:1 Pod/PVC mapping.  We may want to
    support aggregation of multiple storage connections into a single
    q-s-d Pod.

## Other topics

### Device Mapper

Another possibility is to leverage the device-mapper from Linux to
provide features such as snapshots and even like Incremental Backup.
For example, [dm-era][] seems to provide the basic primitives for
bitmap tracking.

This could be part of scenario number 1, or cascaded with other PVs
somewhere else.

Is this already being used? For example, [cybozu-go/topolvm][] is a
CSI LVM Plugin for k8s.

### Stratis

[Stratis][] seems to be an interesting project to be leveraged in the
world of Kubernetes.

### vhost-user-blk in other CSI backends

Would it make sense for other CSI backends to implement support for
vhost-user-blk?

[CSI]: https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/
[KubeVirt]: https://kubevirt.io/
[PVC resize blog]: 
https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/
[Persistent Volumes]: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
[SPDK]: https://spdk.io/
[Storage-Current]: 
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Current.png
[Storage-Passthrough]: 
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Passthrough.png
[Storage-QSD]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-QSD.png
[Storage-Virtiofs]: 
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Virtiofs.png
[Stratis]: https://stratis-storage.github.io/
[cybozu-go/topolvm]: https://github.com/cybozu-go/topolvm
[dm-era]: https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/era.html
[kubevirt/kubevirt#3208]: https://github.com/kubevirt/kubevirt/pull/3208
[vhost_user_block]: 
https://github.com/cloud-hypervisor/cloud-hypervisor/tree/master/vhost_user_block
[virtio-fs]: https://virtio-fs.gitlab.io/

[RFC DOCUMENT 05/12] kubevirt-and-kvm: Add Networking page

Posted by Andrea Bolognani 5 years, 4 months ago

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md

# Networking

## Problem description

Service meshes (such as [Istio][], [Linkerd][]) typically expect
application processes to run on the same physical host, usually in a
separate user namespace. Network namespaces might be used too, for
additional isolation. Network traffic to and from local processes is
monitored and proxied by redirection and observation of local
sockets. `iptables` and `nftables` (collectively referred to as the
`netfilter` framework) are the typical Linux facilities providing
classification and redirection of packets.

![containers][Networking-Containers]
*Service meshes with containers. Typical ingress path:
**1.** NIC driver queues buffers for IP processing
**2.** `netfilter` rules installed by *service mesh* redirect packets
       to proxy
**3.** IP receive path completes, L4 protocol handler invoked
**4.** TCP socket of proxy receives packets
**5.** proxy opens TCP socket towards application service
**6.** packets get TCP header, ready for classification
**7.** `netfilter` rules installed by service mesh forward request to
       service
**8.** local IP routing queues packets for TCP protocol handler
**9.** application process receives packets and handles request.
Egress path is conceptually symmetrical.*

If we move application processes to VMs, sockets and processes are
not visible anymore. All the traffic is typically forwarded via
interfaces operating at data link level. Socket redirection and port
mapping to local processes don't work.

![and now?][Networking-Challenge]
*Application process moved to VM:
**8.** IP layer enqueues packets to L2 interface towards application
**9.** `tap` driver forwards L2 packets to guest
**10.** packets are received on `virtio-net` ring buffer
**11.** guest driver queues buffers for IP processing
**12.** IP receive path completes, L4 protocol handler invoked
**13.** TCP socket of application receives packets and handles request.
**:warning: Proxy challenge**: the service mesh can't forward packets
to local sockets via `netfilter` rules. *Add-on* NAT rules might
conflict, as service meshes expect full control of the ruleset.
Socket monitoring and PID/UID classification isn't possible.*

## Existing solutions

Existing solutions typically implement a full TCP/IP stack, replaying
traffic on sockets local to the Pod of the service mesh. This creates
the illusion of application processes running on the same host,
eventually separated by user namespaces.

![slirp][Networking-Slirp]
*Existing solutions introduce a third TCP/IP stack:
**8.** local IP routing queues packets for TCP protocol handler
**9.** userspace implementation of TCP/IP stack receives packets on
       local socket, and
**10.** forwards L2 encapsulation to `tap` *QEMU* interface (socket
        back-end).*

While being almost transparent to the service mesh infrastructure,
this kind of solution comes with a number of downsides:

* three different TCP/IP stacks (guest, adaptation and host) need to
  be traversed for every service request. There are no chances to
  implement zero-copy mechanisms, and the amount of context switches
  increases dramatically
* addressing needs to be coordinated to create the pretense of
  consistent addresses and routes between guest and host
  environments. This typically needs a NAT with masquerading, or some
  form of packet bridging
* the traffic seen by the service mesh and observable externally is a
  distant replica of the packets forwarded to and from the guest
  environment:
  * TCP congestion windows and network buffering mechanisms in
    general operate differently from what would be naturally expected
    by the application
  * protocols carrying addressing information might pose additional
    challenges, as the applications don't see the same set of
    addresses and routes as they would if deployed with regular
    containers

## Experiments

![experiments: thin layer][Networking-Experiments-Thin-Layer]
*How can we improve on the existing solutions while maintaining
drop-in compatibility? A thin layer implements a TCP adaptation
and IP services.*

These are some directions we have been exploring so far:

* a thinner layer between guest and host, that only implements what's
  strictly needed to pretend processes are running locally. A further
  TCP/IP stack is not necessarily needed. Some sort of TCP adaptation
  is needed, however, if this layer (currently implemented as
  userspace process) runs without the `CAP_NET_RAW` capability: we
  can't create raw IP sockets on the Pod, and therefore need to map
  packets at layer 2 to layer 4 sockets offered by the host kernel
  * to avoid implementing an actual TCP/IP stack like the one
    offered by *libslirp*, we can align TCP parameters advertised
    towards the guest (MSS, congestion window) to the socket
    parameters provided by the host kernel, probing them via the
    `TCP_INFO` socket option (introduced in Linux 2.4).
    Segmentation and reassembly is therefore not needed, providing
    solid chances to avoid dynamic memory allocation altogether, and
    congestion control becomes implicitly equivalent as parameters
    are mirrored between the two sides
  * to reflect the actual receiving dynamic of the guest and support
    retransmissions without a permanent userspace buffer, segments
    are not dequeued (`MSG_PEEK`) until acknowledged by the receiver
    (application)
  * similarly, the implementation of the host-side sender adjusts MSS
    (`TCP_MAXSEG` socket option, since Linux 2.6.28) and advertised
    window (`TCP_WINDOW_CLAMP`, since Linux 2.4) to the parameters
    observed from incoming packets
  * this adaptation layer needs to maintain some of the TCP states,
    but we can rely on the host kernel TCP implementation for the
    different states of connections being closed
  * no particular requirements are placed on the MTU of guest
    interfaces: if fragments are received, payload from the single
    fragmented packets can be reassembled by the host kernel as
    needed, and out-of-order fragments can be safely discarded, as
    there's no intermediate hop justifying the condition
* this layer would connect to `qemu` over a *UNIX domain socket*,
  instead of a `tap` interface, so that the `CAP_NET_ADMIN`
  capability doesn't need to be granted to any process on the Pod:
  no further network interfaces are created on the host
* transparent, adaptive mapping of ports to the guest, to avoid the
  need for explicit port forwarding
* security and maintainability goals: no dynamic memory allocation,
  ~2 000 *LoC* target, no external dependencies

![experiments: ebpf][Networking-Experiments-eBPF]
*Additionally, an `eBPF` fast path could be implemented
**6.** hooking at socket level, and
**7.** mapping IP and Ethernet addresses,
with the existing layer implementing connection tracking and slow
path*

If additional capabilities are granted, the data path can be
optimised in several ways:

* with `CAP_NET_RAW`:
  * the adaptation layer can use raw IP sockets instead of L4 sockets,
    implementing a pure connection tracking, without the need for any
    TCP logic: the guest operating system implements the single TCP
    stack needed with this variation
  * zero-copy mechanisms could be implemented using `vhost-user` and
    QEMU socket back-ends, instead of relying on a full-fledged layer 2
    (Ethernet) interface
* with `CAP_BPF` and `CAP_NET_ADMIN`:
  * context switching in packet forwarding could be avoided by the
    `sockmap` extension provided by `eBPF`, and programming the `XDP`
    data hooks for in-kernel data transfers
  * using eBPF programs, we might want to switch (dynamically?) to
    the `vhost-net` facility
  * the userspace process would still need to take care of
    establishing in-kernel flows, and providing IP and IPv6
    services (ARP, DHCP, NDP) for addressing transparency and to
    avoid the need for further capabilities (e.g.
    `CAP_NET_BIND_SERVICE`), but the main, fast datapath would
    reside entirely in the kernel

[Istio]: https://istio.io/
[Linkerd]: https://linkerd.io/
[Networking-Challenge]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Challenge.png
[Networking-Containers]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Containers.png
[Networking-Experiments-Thin-Layer]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Experiments-Thin-Layer.png
[Networking-Experiments-eBPF]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Experiments-eBPF.png
[Networking-Slirp]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Networking-Slirp.png

[RFC DOCUMENT 06/12] kubevirt-and-kvm: Add Live Migration page

Posted by Andrea Bolognani 5 years, 4 months ago

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Live-Migration.md

# Live Migration

There are two scenarios where live migration is triggered in KubeVirt

* As per user request, by posting a `VirtualMachineInstanceMigration`
  to the cluster
* As per cluster request, for instance on a Node eviction (due lack
  of resources or maintenance of given Node)

In both situations, KubeVirt will use libvirt to handle logic and
coordination with QEMU while KubeVirt's components manage the
Kubernetes control plane and the cluster's limitations.

In short, KubeVirt:

* Checks if the target host is capable of migration of the given VM;
* Handles single network namespace per Pod by proxying migration data
  (more at [Networking][])
* Handles cluster resources usage (e.g: bandwidth usage);
* Handles cross-version migration;

![Live migration between two nodes][Live-Migration-Flow]

## Limitations

Live migration is not possible if:

* The VM is configured with cpu-passthrough;
* The VM has local or non-shared volume
* The Pod is using bridge binding for network access (right side of
  image below)

![Kubevirt's Pod][Live-Migration-Network]

## More on KubeVirt's Live migration

This blog [post on live migration][] explains how to have live
migration enabled in KubeVirt's VMs and describes some of its
caveats.

[Live-Migration-Flow]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Live-Migration-Flow.png
[Live-Migration-Network]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Live-Migration-Network.png
[Networking]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
[post on live migration]: https://kubevirt.io/2020/Live-migration.html

[RFC DOCUMENT 07/12] kubevirt-and-kvm: Add CPU Pinning page

Posted by Andrea Bolognani 5 years, 4 months ago

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/CPU-Pinning.md

# CPU pinning

As is the case for many of KubeVirt's features, CPU pinning is
partially achieved using standard Kubernetes components: this both
reduces the amount of new code that has to be written and guarantees
better integration with containers running side by side with the VMs.

## Kubernetes CPU Manager

The Static policy allocates exclusive CPUs to pod containers in the
Guaranteed QoS class which request integer CPUs. On a best-effort
basis, the Static policy tries to allocate CPUs topologically in the
following order:

* Allocate all the CPUs in the same processor socket if available and
  the container requests at least an entire socket worth of CPUs.
* Allocate all the logical CPUs (hyperthreads) from the same physical
  CPU core if available and the container requests an entire core
  worth of CPUs.
* Allocate any available logical CPU, preferring to acquire CPUs from
  the same socket.

## KubeVirt dedicated CPU placement

KubeVirt relies on the Kubernetes CPU Manager to allocate dedicated
CPUs to the `virt-launcher` container.

When `virt-launcher` starts, it reads
`/sys/fs/cgroup/cpuset/cpuset.cpus` and generates `<vcpupin>`
configuration for libvirt based on the information found within.
However, affinity changes require `CAP_SYS_NICE` so this additional
capability has to be granted to the VM pod.

Going forward, we would like to perform the affinity change in
`virt-handler` (the privileged component running at the node level),
which would allow the VM pod to work without additional capabilities.

[RFC DOCUMENT 08/12] kubevirt-and-kvm: Add NUMA Pinning page

Posted by Andrea Bolognani 5 years, 4 months ago

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/NUMA-Pinning.md

# NUMA pinning

KubeVirt doesn't currently implement NUMA pinning due to Kubernetes
limitation.

## Kubernetes Topology Manager

Allows aligning CPU and peripheral device allocations by NUMA node.

Many limitations:

* Not scheduler aware.
* Doesn’t allow memory alignment.
* etc...

[RFC DOCUMENT 09/12] kubevirt-and-kvm: Add Isolation page

Posted by Andrea Bolognani 5 years, 4 months ago

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Isolation.md

# Isolation

How is the QEMU process isolated from the host and from other VMs?

## Traditional virtualization

cgroups

* managed by libvirt

SELinux

* libvirt is privileged and QEMU is protected by SELinux policies set
  by libvirt (SVirt)
* QEMU runs with SELinux type `svirt_t`

## KubeVirt

cgroups

* Managed by kubelet
  * No involvement from libvirt
* Memory limits
  * When using hard limits, the entire VM can be killed by Kubernetes
  * Memory consumption estimates are based on heuristics

SELinux

* KubeVirt is not using SVirt and there are no plans to do so
* At the moment, the custom [KubeVirt SELinux policy][] is used to
  ensure libvirt has sufficient privilege to perform its own setup
  steps
  * The standard SELinux type used by containers is `container_t`
  * KubeVirt would like to eventually use the same for VMs as well

Capabilities

* The default set of capabilities is fairly conservative
  * Privileged operation should happen outside of the pod: in
    KubeVirt's case, a good candidate is `virt-handler`, the
    privileged components that runs at the node level
* Additional capabilities can be requested for a pod
  * However, this is frowned upon and considered a liability from the
    security point of view
  * The cluster admin may even set a security policy that prevent
    pods from using certain capabilities
    * In such a scenario, KubeVirt workloads may be entirely unable
      to run

## Specific examples

The following is a list of examples, either historical or current, of
scenarios where libvirt's approach to isolation clashed with
Kubernetes' and changes on either component were necessary.

SELinux

* libvirt use of hugetlbfs for hugepages config is disallowed by
  `container_t`
  * Possibly fixable by using memfd
    * [libvirt memoryBacking docs][]
    * [KubeVirt memfd issue][]
* Use of libvirt+QEMU multiqueue tap support is disallowed by
  `container_t`
  * And there’s no way to pass in this setup from outside the
    existing stack
  * [KubeVirt multiqueue workaround][] extending their SELinux policy to allow
    `attach_queue`
* Passing precreated tap devices to libvirt triggers
  relabelfrom+relabelto `tun_socket` SELinux access
  * This may not be virt stacks fault, seems to happen automatically
    when permissions aren’t correct

Capabilities

* libvirt performs memory locking for VFIO devices unconditionally
  * Previously KubeVirt had to grant `CAP_SYS_RESOURCE` to pods.
    KubeVirt worked around it by duplicating libvirt’s memory pinning
    calculations so the libvirt action would be a no-op, but that is
    fragile and may cause the issue to resurface if libvirt
    calculation logic changes.
  * References: [libvir-list memlock thread][], [KubeVirt memlock
    PR][], [libvirt qemuDomainGetMemLockLimitBytes][], [KubeVirt
    VMI.getMemlockSize][]
* virtiofsd requires `CAP_SYS_ADMIN` capability to perform
  `unshare(CLONE_NEWPID|CLONE_NEWNS)`
  * This is required for certain use cases like running overlayfs in
    the VM on top of virtiofs, but is not a requirement for all
    usecases.
  * References: [KubeVirt virtiofs PR][], [RHEL virtiofs bug][]
* KubeVirt uses libvirt for CPU pinning, which requires the pod to
  have `CAP_SYS_NICE`.
  * Long term, KubeVirt would like to handle that pinning in their
    privileged component virt-handler, so `CAP_SYS_NICE` can be
    dropped.
    * Sidenote: libvirt unconditionally requires `CAP_SYS_NICE` when
      any other running VM is using CPU pinning, however this sounds
      like a plain old bug.
  * References: [KubeVirt CPU pinning PR][], [KubeVirt CPU pinning
    workaround PR][], [RHEL CPU pinning bug][]
* libvirt bridge usage used to require `CAP_NET_ADMIN`
  * This is a historical example for reference. libvirt usage of a
    bridge device always implied tap device creation, which required
    `CAP_NET_ADMIN` privileges for the pod
  * The fix was to teach libvirt to accept a precreated tap device
    and skip some setup operations on it
    * Example XML: `<interface type='ethernet'><target dev='mytap0'
      managed='no'/></interface>`
  * Kubevirt still hasn’t fully managed to drop `CAP_NET_ADMIN`
    though
  * References: [RHEL precreated TAP bug][], [libvirt precreated TAP
    patches][], [KubeVirt precreated TAP PR][], [KubeVirt NET_ADMIN
    issue][], [KubeVirt NET_ADMIN issue][]

[KubeVirt CPU pinning PR]: https://github.com/kubevirt/kubevirt/pull/1381
[KubeVirt CPU pinning workaround PR]: https://github.com/kubevirt/kubevirt/pull/1648
[KubeVirt NET_ADMIN PR]: https://github.com/kubevirt/kubevirt/pull/3290
[KubeVirt NET_ADMIN issue]: https://github.com/kubevirt/kubevirt/issues/3085
[KubeVirt SELinux policy]: https://github.com/kubevirt/kubevirt/blob/master/cmd/virt-handler/virt_launcher.cil
[KubeVirt VMI.getMemlockSize]: https://github.com/kubevirt/kubevirt/blob/f5ffba5f84365155c81d0e2cda4aa709da062230/pkg/virt-handler/isolation/isolation.go#L206
[KubeVirt memfd issue]: https://github.com/kubevirt/kubevirt/issues/3781
[KubeVirt memlock PR]: https://github.com/kubevirt/kubevirt/pull/2584
[KubeVirt multiqueue workaround]: https://github.com/kubevirt/kubevirt/pull/2941/commits/bc55cb916003c54f6cbf329112a4e36d0d874836
[KubeVirt precreated TAP PR]: https://github.com/kubevirt/kubevirt/pull/2837
[KubeVirt virtiofs PR]: https://github.com/kubevirt/kubevirt/pull/3493
[RHEL CPU pinning bug]: https://bugzilla.redhat.com/show_bug.cgi?id=1819801
[RHEL precreated TAP bug]: https://bugzilla.redhat.com/show_bug.cgi?id=1723367
[RHEL virtiofs bug]: https://bugzilla.redhat.com/show_bug.cgi?id=1854595
[libvir-list memlock thread]: https://www.redhat.com/archives/libvirt-users/2019-August/msg00046.html
[libvirt memoryBacking docs]: https://libvirt.org/formatdomain.html#elementsMemoryBacking
[libvirt precreated TAP patches]: https://www.redhat.com/archives/libvir-list/2019-August/msg01256.html
[libvirt qemuDomainGetMemLockLimitBytes]: https://gitlab.com/libvirt/libvirt/-/blob/84bb5fd1ab2bce88e508d416f4bcea520c803ea8/src/qemu/qemu_domain.c#L8712

[RFC DOCUMENT 10/12] kubevirt-and-kvm: Add Upgrades page

Posted by Andrea Bolognani 5 years, 4 months ago

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Upgrades.md

# Upgrades

The KubeVirt installation and upgrade process are entirely controlled
by an [operator][], which is a common pattern in the Kubernetes
world. The operator is a piece of software running in the cluster and
managing the lifecycle of other components, in this case KubeVirt.

## The operator

What it does:

* Manages the whole KubeVirt installation
* Keeps the cluster actively in sync with the desired state
* Upgrades KubeVirt with zero downtime

## Installation

Install the operator:

```bash
$ LATEST=$(curl -L https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/latest)
$ kubectl apply -f https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/$(LATEST)/kubevirt-operator.yaml
$ kubectl get pods -n kubevirt
NAME                                      READY   STATUS    RESTARTS   AGE
virt-operator-58cf9d6648-c7qph            1/1     Running   0          69s
virt-operator-58cf9d6648-pvzw2            1/1     Running   0          69s
```

Trigger the installation of KubeVirt:

```bash
$ LATEST=$(curl -L https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/latest)
$ kubectl apply -f https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/${LATEST}/kubevirt-cr.yaml
$ kubectl get pods -n kubevirt
NAME                                      READY   STATUS    RESTARTS   AGE
virt-api-8bdd88557-fllhr                  1/1     Running   0          59s
virt-controller-55ccb8cdcb-5rtp6          1/1     Running   0          43s
virt-controller-55ccb8cdcb-v8kr9          1/1     Running   0          43s
virt-handler-67pns                        1/1     Running   0          43s
```

The process happens in two steps because the operator relies on the
KubeVirt [custom resource][] for information on the desired
installation, and will not do anything until that resource exists in
the cluster.

## Upgrade

The upgrading process is similar:

* Install the latest operator
* Reference the new version in the KubeVirt CustomResource to trigger
  the actual upgrade

```bash
$ kubectl.sh get kubevirt -n kubevirt kubevirt -o yaml
apiVersion: kubevirt.io/v1alpha3
kind: KubeVirt
metadata:
  name: kubevirt
spec:
  imageTag: v0.30
  certificateRotateStrategy: {}
  configuration: {}
  imagePullPolicy: IfNotPresent
```

Note the `imageTag` attribute: when present, the KubeVirt operator
will take steps to ensure that the version of KubeVirt that's
deployed on the cluster matches its value, which in our case will
trigger an upgrade.

The following chart explain the upgrade flow in more detail and shows
how the various components are affected:

![KubeVirt upgrade flow][Upgrades-Kubevirt]

KubeVirt is released as a complete suite: no individual
`virt-launcher` releases are planned. Everything is tested together,
everything is released together.

## QEMU and libvirt

The versions of QEMU and libvirt used for VMs are also tied to the
version of KubeVirt and are upgraded along with everything else.

* Migrations from old libvirt/QEMU to new libvirt/QEMU pairs are
  possible
* As soon as the new `virt-handler` and the new controller are rolled
  out, the cluster will only start VMIs with the new QEMU/libvirt
  versions

## Version compatibility

The virt stack is updated along with KubeVirt, which mitigates
compatibility concerns. As a rule of thumb, versions of QEMU and
libvirt older than a year or so are not taken into consideration.

Currently, the ability to perform backwards migation (eg. from a
newer version of QEMU to an older one) is not necessary, but that
could very well change as KubeVirt becomes more widely used.

[Upgrades-Kubevirt]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Upgrades-Kubevirt.png
[custom resource]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
[operator]: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/

[RFC DOCUMENT 11/12] kubevirt-and-kvm: Add Backpropagation page

Posted by Andrea Bolognani 5 years, 4 months ago

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Backpropagation.md

# Backpropagation

Whenever a partial VM configuration is submitted to libvirt, any missing
information is automatically filled in to obtain a configuration that's
complete enough to guarantee long-term guest ABI stability.

PCI addresses are perhaps the most prominent example of this: most management
applications don't include this information at all in the XML they submit to
libvirt, and rely on libvirt building a reasonable PCI topology to support the
requested devices.

For example, using a made-up YAML syntax for brevity, the input could look like

```yaml
devices:
  disks:
  - image: /path/to/image.qcow2
```

and the output could be augmented by libvirt to look like

```yaml
devices:
  controllers:
  - model: pcie-root-port
    address:
      type: pci
      domain: 0x0000
      bus: 0x00
      slot: 0x01
      function: 0x0
  disks:
  - image: /path/to/image.qcow2
    model: virtio-blk
    address:
      type: pci
      domain: 0x0000
      bus: 0x01
      slot: 0x00
      function: 0x0
```

This is where backpropagation comes in: the only version of the VM
configuration that is complete enough to guarantee a stable guest ABI is the
one that includes all information added by libvirt, so if the management
application wants to be able to make further changes to the VM it needs to
reflect the additional information back into its understanding of the VM
configuration somehow.

For applications like virsh and virt-manager, this is easy: they don't have
their own configuration format or even store the VM configuration, and
simply fetch it from libvirt and operate on it directly every single time.

oVirt, to the best of my knowledge, generates an initial VM configuration based
on the settings provided by the user, submits it to libvirt and then parses
back the augmented version, figuring out what information was added and
updating its database to match: if the VM configuration needs to be generated
again later, it will include all information present in the database, including
those that originated from libvirt rather than the user.

KubeVirt does not currently perform any backpropagation. There are two ways a
user can influence PCI address allocation:

* explicitly add a `pciAddress` attribute for the device, which will cause
  KubeVirt to pass the corresponding address to libvirt, which in turn will
  attempt to comply with the user's request;
* add the `kubevirt.io/placePCIDevicesOnRootComplex` annotation to the VM
  configuration, which will cause KubeVirt to provide libvirt with a
  fully-specified PCI topology where all devices live on the PCIe Root Bus.

In all cases but the one where KubeVirt defines the full PCI topology itself,
it's implicitly relying on libvirt always building the PCI topology in the
exact same way every single time in order to have a stable guest ABI. While
this works in practice, it's not something that libvirt actually guarantees:
once a VM has been defined, libvirt will never change its PCI topology, but
submitting the same partial VM configuration to different libvirt versions can
result in different PCI topologies.

[RFC DOCUMENT 12/12] kubevirt-and-kvm: Add Contacts page

Posted by Andrea Bolognani 5 years, 4 months ago

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Contacts.md

# Contacts and credits

# Contacts

The following people have agreed to serve as points of contact for
follow-up discussion around the topics included in these documents.

## Overall

* Andrea Bolognani <<abologna@redhat.com>> (KVM user space)
* Cole Robinson <<crobinso@redhat.com>> (KVM user space)
* Roman Mohr <<rmohr@redhat.com>> (KubeVirt)
* Vladik Romanovsky <<vromanso@redhat.com>> (KubeVirt)

## Networking

* Alona Paz <<alkaplan@redhat.com>> (KubeVirt)
* Stefano Brivio <<sbrivio@redhat.com>> (KVM user space)

## Storage

* Adam Litke <<alitke@redhat.com>> (KubeVirt)
* Stefan Hajnoczi <<stefanha@redhat.com>> (KVM user space)

# Credits

In addition to those listed above, the following people have also
contributed to the documents or the discussion around them.

Ademar Reis, Adrian Moreno Zapata, Alice Frosi, Amnon Ilan, Ariel
Adam, Christophe de Dinechin, Dan Kenisberg, David Gilbert, Eduardo
Habkost, Fabian Deutsch, Gerd Hoffmann, Jason Wang, John Snow, Kevin
Wolf, Marc-André Lureau, Michael Henriksen, Michael Tsirkin, Paolo
Bonzini, Peter Krempa, Petr Horacek, Richard Jones, Sergio Lopez,
Steve Gordon, Victor Toso, Viviek Goyal.

If your name should be in the list above but is not, please know that
was an honest mistake and not a way to downplay your contribution!
Get in touch and we'll get it sorted out :)