From nobody Sat Nov 30 00:37:01 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) client-ip=192.237.175.120; envelope-from=xen-devel-bounces@lists.xenproject.org; helo=lists.xenproject.org; Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org ARC-Seal: i=1; a=rsa-sha256; t=1620297812; cv=none; d=zohomail.com; s=zohoarc; b=h3esXWsH/rSUhy9i9TEtDEsHg1Ri12tKIBBqi9dcJIEFIUC5wEI9JHdkkzQCrESVhhepTuQnfT9EPwG2886ZM3t71OyXGRlKdMLklqgLCzXreaoL6T7JCNLeZWZWmCsER4wrfGHyxC3PXQpcKZGe2DcxErvqKV/Dp2VqM1yPs4w= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1620297812; h=Content-Type:Content-Transfer-Encoding:Cc:Date:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:To; bh=DgxRhrQIFfjxPAv/IQMiKpMR916qTHixAm/Vojyp6T8=; b=TnaWuDkV2cSTP8XDnHe0lpaoJGQOBxtdRmx3oEPncT1yP1xepjdE+v/TbX/eVmW/Gb+Q81H6JGJqjc8cig2830bUrYTV33riPLupdKowe6bs//cg6XTauKVY6XXGauflD3VpE/XCMFplx3k/ZPgfP35Y/eB1J0HGnvYEDjn1OAg= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org Return-Path: Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) by mx.zohomail.com with SMTPS id 1620297812388590.9981360191462; Thu, 6 May 2021 03:43:32 -0700 (PDT) Received: from list by lists.xenproject.org with outflank-mailman.123418.232768 (Exim 4.92) (envelope-from ) id 1lebTN-0004cy-6q; Thu, 06 May 2021 10:43:13 +0000 Received: by outflank-mailman (output) from mailman id 123418.232768; Thu, 06 May 2021 10:43:13 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1lebTN-0004cq-3c; Thu, 06 May 2021 10:43:13 +0000 Received: by outflank-mailman (input) for mailman id 123418; Thu, 06 May 2021 10:43:12 +0000 Received: from mail.xenproject.org ([104.130.215.37]) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1lebTM-0004NU-0s for xen-devel@lists.xenproject.org; Thu, 06 May 2021 10:43:12 +0000 Received: from xenbits.xenproject.org ([104.239.192.120]) by mail.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1lebTH-000650-Nm; Thu, 06 May 2021 10:43:07 +0000 Received: from 54-240-197-235.amazon.com ([54.240.197.235] helo=ufe34d9ed68d054.ant.amazon.com) by xenbits.xenproject.org with esmtpsa (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1lebTH-0002l1-B2; Thu, 06 May 2021 10:43:07 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=xen.org; s=20200302mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version: References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From; bh=DgxRhrQIFfjxPAv/IQMiKpMR916qTHixAm/Vojyp6T8=; b=STPxHbZ4qaFE1AoVq+PuE+wx9R rPFL4MP0N9MNRWd+Ar3wQYFgAVRX8S4BSWt/Lf24RGRK6fVsWZUYb7683CKEI5v0pfpi0nfFLkhIL 4jt78Pf+pWyFk8LWrktK7SoUQbqNJOuaYy6mphGUamyxqdw8J5kDLyzbpBdsKdcE7jwM=; From: Julien Grall To: xen-devel@lists.xenproject.org Cc: dwmw2@infradead.org, paul@xen.org, hongyxia@amazon.com, raphning@amazon.com, maghul@amazon.com, Julien Grall , Andrew Cooper , George Dunlap , Ian Jackson , Jan Beulich , Julien Grall , Stefano Stabellini , Wei Liu Subject: [PATCH RFC 1/2] docs/design: Add a design document for Live Update Date: Thu, 6 May 2021 11:42:58 +0100 Message-Id: <20210506104259.16928-2-julien@xen.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20210506104259.16928-1-julien@xen.org> References: <20210506104259.16928-1-julien@xen.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-ZohoMail-DKIM: pass (identity @xen.org) From: Julien Grall Administrators often require updating the Xen hypervisor to address security vulnerabilities, introduce new features, or fix software defects. Currently, we offer the following methods to perform the update: * Rebooting the guests and the host: this is highly disrupting to runni= ng guests. * Migrating off the guests, rebooting the host: this currently requires the guest to cooperate (see [1] for a non-cooperative solution) and it may not always be possible to migrate it off (i.e lack of capacity, u= se of local storage...). * Live patching: This is the less disruptive of the existing methods. However, it can be difficult to prepare the livepatch if the change is large or there are data structures to update. This patch will introduce a new proposal called "Live Update" which will activate new software without noticeable downtime (i.e no - or minimal - customer). Signed-off-by: Julien Grall --- docs/designs/liveupdate.md | 254 +++++++++++++++++++++++++++++++++++++ 1 file changed, 254 insertions(+) create mode 100644 docs/designs/liveupdate.md diff --git a/docs/designs/liveupdate.md b/docs/designs/liveupdate.md new file mode 100644 index 000000000000..32993934f4fe --- /dev/null +++ b/docs/designs/liveupdate.md @@ -0,0 +1,254 @@ +# Live Updating Xen + +## Background + +Administrators often require updating the Xen hypervisor to address securi= ty +vulnerabilities, introduce new features, or fix software defects. Current= ly, +we offer the following methods to perform the update: + + * Rebooting the guests and the host: this is highly disrupting to runn= ing + guests. + * Migrating off the guests, rebooting the host: this currently requires + the guest to cooperate (see [1] for a non-cooperative solution) and = it + may not always be possible to migrate it off (i.e lack of capacity, = use + of local storage...). + * Live patching: This is the less disruptive of the existing methods. + However, it can be difficult to prepare the livepatch if the change = is + large or there are data structures to update. + +This document will present a new approach called "Live Update" which will +activate new software without noticeable downtime (i.e no - or minimal - +customer pain). + +## Terminology + +xen#1: Xen version currently active and running on a droplet. This is the +=E2=80=9Csource=E2=80=9D for the Live Update operation. This version can = actually be newer +than xen#2 in case of a rollback operation. + +xen#2: Xen version that's the =E2=80=9Ctarget=E2=80=9D of the Live Update = operation. This +version will become the active version after successful Live Update. This +version of Xen can actually be older than xen#1 in case of a rollback +operation. + +## High-level overview + +Xen has a framework to bring a new image of the Xen hypervisor in memory u= sing +kexec. The existing framework does not meet the baseline functionality for +Live Update, since kexec results in a restart for the hypervisor, host, Do= m0, +and all the guests. + +The operation can be divided in roughly 4 parts: + + 1. Trigger: The operation will by triggered from outside the hypervisor + (e.g. dom0 userspace). + 2. Save: The state will be stabilized by pausing the domains and + serialized by xen#1. + 3. Hand-over: xen#1 will pass the serialized state and transfer contro= l to + xen#2. + 4. Restore: The state will be deserialized by xen#2. + +All the domains will be paused before xen#1 is starting to save the states, +and any domain that was running before Live Update will be unpaused after +xen#2 has finished to restore the states. This is to prevent a domain to = try +to modify the state of another domain while it is being saved/restored. + +The current approach could be seen as non-cooperative migration with a twi= st: +all the domains (including dom0) are not expected be involved in the Live +Update process. + +The major differences compare to live migration are: + + * The state is not transferred to another host, but instead locally to + xen#2. + * The memory content or device state (for passthrough) does not need to + be part of the stream. Instead we need to preserve it. + * PV backends, device emulators, xenstored are not recreated but prese= rved + (as these are part of dom0). + + +Domains in process of being destroyed (*XEN\_DOMCTL\_destroydomain*) will = need +to be preserved because another entity may have mappings (e.g foreign, gra= nt) +on them. + +## Trigger + +Live update is built on top of the kexec interface to prepare the command = line, +load xen#2 and trigger the operation. A new kexec type has been introduced +(*KEXEC\_TYPE\_LIVE\_UPDATE*) to notify Xen to Live Update. + +The Live Update will be triggered from outside the hypervisor (e.g. dom0 +userspace). Support for the operation has been added in kexec-tools 2.0.2= 1. + +All the domains will be paused before xen#1 is starting to save the states. +In Xen, *domain\_pause()* will pause the vCPUs as soon as they can be re- +scheduled. In other words, a pause request will not wait for asynchronous +requests (e.g. I/O) to finish. For Live Update, this is not an ideal time= to +pause because it will require more xen#1 internal state to be transferred. +Therefore, all the domains will be paused at an architectural restartable +boundary. + +Live update will not happen synchronously to the request but when all the +domains are quiescent. As domains running device emulators (e.g Dom0) will +be part of the process to quiesce HVM domains, we will need to let them run +until xen#1 is actually starting to save the state. HVM vCPUs will be pau= sed +as soon as any pending asynchronous request has finished. + +In the current implementation, all PV domains will continue to run while t= he +rest will be paused as soon as possible. Note this approach is assuming t= hat +device emulators are only running in PV domains. + +It should be easy to extend to PVH domains not requiring device emulations. +It will require more thought if we need to run device models in HVM domain= s as +there might be inter-dependency. + +## Save + +xen#1 will be responsible to preserve and serialize the state of each exis= ting +domain and any system-wide state (e.g M2P). + +Each domain will be serialized independently using a modified migration st= ream, +if there is any dependency between domains (such as for IOREQ server) they= will +be recorded using a domid. All the complexity of resolving the dependencie= s are +left to the restore path in xen#2 (more in the *Restore* section). + +At the moment, the domains are saved one by one in a single thread, but it +would be possible to consider multi-threading if it takes too long. Althou= gh +this may require some adjustment in the stream format. + +As we want to be able to Live Update between major versions of Xen (e.g Xen +4.11 -> Xen 4.15), the states preserved should not be a dump of Xen intern= al +structure but instead the minimal information that allow us to recreate the +domains. + +For instance, we don't want to preserve the frametable (and therefore +*struct page\_info*) as-is because the refcounting may be different across +between xen#1 and xen#2 (see XSA-299). Instead, we want to be able to recr= eate +*struct page\_info* based on minimal information that are considered stable +(such as the page type). + +Note that upgrading between version of Xen will also require all the hyper= calls +to be stable. This will not be covered by this document. + +## Hand over + +### Memory usage restrictions + +xen#2 must take care not to use any memory pages which already belong to +guests. To facilitate this, a number of contiguous region of memory are +reserved for the boot allocator, known as *live update bootmem*. + +xen#1 will always reserve a region just below Xen (the size is controlled = by +the Xen command line parameter liveupdate) to allow Xen growing and provide +information about LiveUpdate (see the section *Breadcrumb*). The region w= ill be +passed to xen#2 using the same command line option but with the base addre= ss +specified. + +For simplicity, additional regions will be provided in the stream. They w= ill +consist of region that could be re-used by xen#2 during boot (such as the +xen#1's frametable memory). + +xen#2 must not use any pages outside those regions until it has consumed t= he +Live Update data stream and determined which pages are already in use by +running domains or need to be re-used as-is by Xen (e.g M2P). + +At run time, Xen may use memory from the reserved region for any purpose t= hat +does not require preservation over a Live Update; in particular it __must_= _ not be +mapped to a domain or used by any Xen state requiring to be preserved (e.g +M2P). In other word, the xenheap pages could be allocated from the reserv= ed +regions if we remove the concept of shared xenheap pages. + +The xen#2's binary may be bigger (or smaller) compare to xen#1's binary. = So +for the purpose of loading xen#2 binary, kexec should treat the reserved m= emory +right below xen#1 and its region as a single contiguous space. xen#2 will = be +loaded right at the top of the contiguous space and the rest of the memory= will +be the new reserved memory (this may shrink or grow). For that reason, fr= eed +init memory from xen#1 image is also treated as reserved liveupdate update +bootmem. + +### Live Update data stream + +During handover, xen#1 creates a Live Update data stream containing all the +information required by the new Xen#2 to restore all the domains. + +Data pages for this stream may be allocated anywhere in physical memory ou= tside +the *live update bootmem* regions. + +As calling __vmap()__/__vunmap()__ has a cost on the downtime. We want to= reduce the +number of call to __vmap()__ when restoring the stream. Therefore the str= eam +will be contiguously virtually mapped in xen#2. xen#1 will create an arra= y of +MFNs of the allocated data pages, suitable for passing to __vmap()__. The +array will be physically contiguous but the MFNs don't need to be physical= ly +contiguous. + +### Breadcrumb + +Since the Live Update data stream is created during the final **kexec\_exe= c** +hypercall, its address cannot be passed on the command line to the new Xen +since the command line needs to have been set up by **kexec(8)** in usersp= ace +long beforehand. + +Thus, to allow the new Xen to find the data stream, xen#1 places a breadcr= umb +in the first words of the Live Update bootmem, containing the number of da= ta +pages, and the physical address of the contiguous MFN array. + +### IOMMU + +Where devices are passed through to domains, it may not be possible to qui= esce +those devices for the purpose of performing the update. + +If performing Live Update with assigned devices, xen#1 will leave the IOMMU +mappings active during the handover (thus implying that IOMMU page tables = may +not be allocated in the *live update bootmem* region either). + +xen#2 must take control of the IOMMU without causing those mappings to bec= ome +invalid even for a short period of time. In other words, xen#2 should not +re-setup the IOMMUs. On hardware which does not support Posted Interrupts, +interrupts may need to be generated on resume. + +## Restore + +After xen#2 initialized itself and map the stream, it will be responsible = to +restore the state of the system and each domain. + +Unlike the save part, it is not possible to restore a domain in a single p= ass. +There are dependencies between: + + 1. different states of a domain. For instance, the event channels ABI + used (2l vs fifo) requires to be restored before restoring the event + channels. + 2. the same "state" within a domain. For instance, in case of PV doma= in, + the pages' ownership requires to be restored before restoring the t= ype + of the page (e.g is it an L4, L1... table?). + + 3. domains. For instance when restoring the grant mapping, it will be + necessary to have the page's owner in hand to do proper refcounting. + Therefore the pages' ownership have to be restored first. + +Dependencies will be resolved using either multiple passes (for dependency +type 2 and 3) or using a specific ordering between records (for dependency +type 1). + +Each domain will be restored in 3 passes: + + * Pass 0: Create the domain and restore the P2M for HVM. This can be b= roken + down in 3 parts: + * Allocate a domain via _domain\_create()_ but skip part that requir= es + extra records (e.g HAP, P2M). + * Restore any parts which needs to be done before create the vCPUs. = This + including restoring the P2M and whether HAP is used. + * Create the vCPUs. Note this doesn't restore the state of the vCPUs. + * Pass 1: It will restore the pages' ownership and the grant-table fra= mes + * Pass 2: This steps will restore any domain states (e.g vCPU state, e= vent + channels) that wasn't + +A domain should not have a dependency on another domain within the same pa= ss. +Therefore it would be possible to take advantage of all the CPUs to restore +domains in parallel and reduce the overall downtime. + +Once all the domains have been restored, they will be unpaused if they were +running before Live Update. + +* * * +[1] https://xenbits.xen.org/gitweb/?p=3Dxen.git;a=3Dblob;f=3Ddocs/designs/= non-cooperative-migration.md;h=3D4b876d809fb5b8aac02d29fd7760a5c0d5b86d87;h= b=3DHEAD + --=20 2.17.1