From nobody Thu Sep 19 01:39:21 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) client-ip=192.237.175.120; envelope-from=xen-devel-bounces@lists.xenproject.org; helo=lists.xenproject.org; Authentication-Results: mx.zohomail.com; dkim=pass header.i=teddy.astie@vates.tech; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass(p=quarantine dis=none) header.from=vates.tech ARC-Seal: i=1; a=rsa-sha256; t=1719415387; cv=none; d=zohomail.com; s=zohoarc; b=dHGcPfFWwm+wr0jtVrrGQi+Jv7fIv7OwFSp85E5iV93DKADq46x1Ik/bM8iKiYUVhNs0GU6L/sPyGLSZmae0qNhvzIs+NMlI3v+7mtXUtZ2FRIw/45DPJvIEBCYghmUOwcITDTzJPDk5SlKAahl/kgmvub02FrVIdIJiuiKQ4I8= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1719415387; h=Content-Type:Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=33Jim/AT+SoQIHMb1mWyjCf8Oyw2QMc8mVmnlKwJbW0=; b=LtPviPHuUo6qmt0LSUo18GFjL7ZV9xdAuH+HcvMMTu6duNLvgUQs6iraJn+6sAjSYjH/RZCtGcbqwfFLXqfcTDCvWb3y96bCZ047Rzlo+yGd3qXhML0749B+tL40n9VM/Zue4VejrYFiQz0oqjl1xWeszsjBQxyzVYtrbgliTsU= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=teddy.astie@vates.tech; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) by mx.zohomail.com with SMTPS id 171941538737912.319656926143466; Wed, 26 Jun 2024 08:23:07 -0700 (PDT) Received: from list by lists.xenproject.org with outflank-mailman.749160.1157185 (Exim 4.92) (envelope-from ) id 1sMUTw-0006Xo-Fd; Wed, 26 Jun 2024 15:22:48 +0000 Received: by outflank-mailman (output) from mailman id 749160.1157185; Wed, 26 Jun 2024 15:22:48 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1sMUTw-0006Xh-Cj; Wed, 26 Jun 2024 15:22:48 +0000 Received: by outflank-mailman (input) for mailman id 749160; Wed, 26 Jun 2024 15:22:46 +0000 Received: from se1-gles-sth1-in.inumbo.com ([159.253.27.254] helo=se1-gles-sth1.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1sMUTu-0006XW-4w for xen-devel@lists.xenproject.org; Wed, 26 Jun 2024 15:22:46 +0000 Received: from mail134-3.atl141.mandrillapp.com (mail134-3.atl141.mandrillapp.com [198.2.134.3]) by se1-gles-sth1.inumbo.com (Halon) with ESMTPS id ef0b585f-33cf-11ef-90a3-e314d9c70b13; Wed, 26 Jun 2024 17:22:44 +0200 (CEST) Received: from pmta10.mandrill.prod.atl01.rsglab.com (localhost [127.0.0.1]) by mail134-3.atl141.mandrillapp.com (Mailchimp) with ESMTP id 4W8QS24phBzDRJ1Ww for ; Wed, 26 Jun 2024 15:22:42 +0000 (GMT) Received: from [37.26.189.201] by mandrillapp.com id 9372eb8379bc40e28f2b20363a613314; Wed, 26 Jun 2024 15:22:42 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: ef0b585f-33cf-11ef-90a3-e314d9c70b13 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; s=mte1; t=1719415362; x=1719675862; bh=33Jim/AT+SoQIHMb1mWyjCf8Oyw2QMc8mVmnlKwJbW0=; h=From:Subject:To:Cc:Message-Id:In-Reply-To:References:Feedback-ID: Date:MIME-Version:Content-Type:Content-Transfer-Encoding:CC:Date: Subject:From; b=VUeM9GbAV49+gfaocJTswbsCSqvad+4KX2Pe8cVUZWzgaTTGBWLbqMmOjTc8cn9or BZA7XJ1zmPb7bvIw/uPysFy3SGOx0l+XGMnoEz9xejKSdNd6UO45QA2KqOgr2HDVcQ HnpqJ/L6z78SaJZso4ZUfyLCh1IiCeQj6MmtVfITiG1CPhWyc14CZJ7S2iOIlmY+oc K+kHvJ8no0yQJ7ULOHkv8X9J1JtCahkJE/1glo4YlHPmdJuqx0LrrbAI2xe3AtvZGG nfHUCD9z5+u1QF2V3uO1DXvKdEkS05N7rXlTju9cm1aFHMFgDQS415TU8Twv00/4qO dO9+21B/ed5+w== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vates.tech; s=mte1; t=1719415362; x=1719675862; i=teddy.astie@vates.tech; bh=33Jim/AT+SoQIHMb1mWyjCf8Oyw2QMc8mVmnlKwJbW0=; h=From:Subject:To:Cc:Message-Id:In-Reply-To:References:Feedback-ID: Date:MIME-Version:Content-Type:Content-Transfer-Encoding:CC:Date: Subject:From; b=CpLOVNHVJwfMpTS6sH7xfPwuevjNkfdkC1N2vmO/4VDHUeIoZBTD8jFJ/xz1zC0OZ 9cX90ACQywxGTQQYqZCs0XAb6qREqBjKtYOanFkz3I3yHnLLuQlPRgYqKjy4m73B30 8lyFYEh0D9xt7ZerwTifwpbG1M5Mt9GrpzcCBoN7Jyqhy83+nSkqkyqMLRPHCEXda/ adpnmKYa7JGnSmuxXkUue0McLqqdc6xlUq+tkIESBte7vMPfmoblmAzMW8+jakTFQA HzWXiPWwTu5a79lMPK3ScS+qYKwJ2+0mPV4MyiizD8mJTl+M/tuCQ9R6AvhvBS6PIH zhTFNXJ/Ig9jg== From: TSnake41 Subject: =?utf-8?Q?[RFC=20XEN=20PATCH=20v2=202/5]=20docs/designs:=20Add=20a=20design=20document=20for=20IOMMU=20subsystem=20redesign?= X-Mailer: git-send-email 2.45.2 X-Bm-Disclaimer: Yes X-Bm-Milter-Handled: 4ffbd6c1-ee69-4e1b-aabd-f977039bd3e2 X-Bm-Transport-Timestamp: 1719415361048 To: xen-devel@lists.xenproject.org Cc: Teddy Astie , Andrew Cooper , George Dunlap , Jan Beulich , Julien Grall , Stefano Stabellini , =?utf-8?Q?Marek=20Marczykowski-G=C3=B3recki?= Message-Id: <0f9658f25c98f1acdab8788c705287d743103d91.1719414736.git.teddy.astie@vates.tech> In-Reply-To: References: X-Native-Encoded: 1 X-Report-Abuse: =?UTF-8?Q?Please=20forward=20a=20copy=20of=20this=20message,=20including=20all=20headers,=20to=20abuse@mandrill.com.=20You=20can=20also=20report=20abuse=20here:=20https://mandrillapp.com/contact/abuse=3Fid=3D30504962.9372eb8379bc40e28f2b20363a613314?= X-Mandrill-User: md_30504962 Feedback-ID: 30504962:30504962.20240626:md Date: Wed, 26 Jun 2024 15:22:42 +0000 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMail-DKIM: pass (identity teddy.astie@vates.tech) (identity @mandrillapp.com) X-ZM-MESSAGEID: 1719415388462100005 Content-Type: text/plain; charset="utf-8" From: Teddy Astie Current IOMMU subsystem has some limitations that make PV-IOMMU practically= impossible. One of them is the assumtion that each domain is bound to a single "IOMMU d= omain", which also causes complications with quarantine implementation. Moreover, current IOMMU subsystem is not entirely well-defined, for instanc= e, the behavior of map_page between ARM SMMUv3 and x86 VT-d/AMD-Vi greatly differs. On ARM,= it can modifies the domain page table while on x86, it may be forbidden (e.g using HAP with= PVH), or only modifying the devices PoV (e.g using PV). The goal of this redesign is to define more explicitely the behavior and in= terface of the IOMMU subsystem while allowing PV-IOMMU to be effectively implemented. Signed-off-by Teddy Astie --- Changed in V2: * nit s/dettach/detach/ --- docs/designs/iommu-contexts.md | 398 +++++++++++++++++++++++++++++++++ 1 file changed, 398 insertions(+) create mode 100644 docs/designs/iommu-contexts.md diff --git a/docs/designs/iommu-contexts.md b/docs/designs/iommu-contexts.md new file mode 100644 index 0000000000..8211f91692 --- /dev/null +++ b/docs/designs/iommu-contexts.md @@ -0,0 +1,398 @@ +# IOMMU context management in Xen + +Status: Experimental +Revision: 0 + +# Background + +The design for *IOMMU paravirtualization for Dom0* [1] explains that some = guests may +want to access to IOMMU features. In order to implement this in Xen, sever= al adjustments +needs to be made to the IOMMU subsystem. + +This "hardware IOMMU domain" is currently implemented on a per-domain basi= s such as each +domain actually has a specific *hardware IOMMU domain*, this design aims t= o allow a +single Xen domain to manage several "IOMMU context", and allow some domain= s (e.g Dom0 +[1]) to modify their IOMMU contexts. + +In addition to this, quarantine feature can be refactored into using IOMMU= contexts +to reduce the complexity of platform-specific implementations and ensuring= more +consistency across platforms. + +# IOMMU context + +We define a "IOMMU context" as being a *hardware IOMMU domain*, but named = as a context +to avoid confusion with Xen domains. +It represents some hardware-specific data structure that contains mappings= from a device +frame-number to a machine frame-number (e.g using a pagetable) that can be= applied to +a device using IOMMU hardware. + +This structure is bound to a Xen domain, but a Xen domain may have several= IOMMU context. +These contexts may be modifiable using the interface as defined in [1] asi= de some +specific cases (e.g modifying default context). + +This is implemented in Xen as a new structure that will hold context-speci= fic +data. + +```c +struct iommu_context { + u16 id; /* Context id (0 means default context) */ + struct list_head devices; + + struct arch_iommu_context arch; + + bool opaque; /* context can't be modified nor accessed (e.g HAP) */ +}; +``` + +A context is identified by a number that is domain-specific and may be use= d by IOMMU +users such as PV-IOMMU by the guest. + +struct arch_iommu_context is splited from struct arch_iommu + +```c +struct arch_iommu_context +{ + spinlock_t pgtables_lock; + struct page_list_head pgtables; + + union { + /* Intel VT-d */ + struct { + uint64_t pgd_maddr; /* io page directory machine address */ + domid_t *didmap; /* per-iommu DID */ + unsigned long *iommu_bitmap; /* bitmap of iommu(s) that the co= ntext uses */ + } vtd; + /* AMD IOMMU */ + struct { + struct page_info *root_table; + } amd; + }; +}; + +struct arch_iommu +{ + spinlock_t mapping_lock; /* io page table lock */ + struct { + struct page_list_head list; + spinlock_t lock; + } pgtables; + + struct list_head identity_maps; + + union { + /* Intel VT-d */ + struct { + /* no more context-specific values */ + unsigned int agaw; /* adjusted guest address width, 0 is level= 2 30-bit */ + } vtd; + /* AMD IOMMU */ + struct { + unsigned int paging_mode; + struct guest_iommu *g_iommu; + } amd; + }; +}; +``` + +IOMMU context information is now carried by iommu_context rather than bein= g integrated to +struct arch_iommu. + +# Xen domain IOMMU structure + +`struct domain_iommu` is modified to allow multiples context within a sing= le Xen domain +to exist : + +```c +struct iommu_context_list { + uint16_t count; /* Context count excluding default context */ + + /* if count > 0 */ + + uint64_t *bitmap; /* bitmap of context allocation */ + struct iommu_context *map; /* Map of contexts */ +}; + +struct domain_iommu { + /* ... */ + + struct iommu_context default_ctx; + struct iommu_context_list other_contexts; + + /* ... */ +} +``` + +default_ctx is a special context with id=3D0 that holds the page table map= ping the entire +domain, which basically preserve the previous behavior. All devices are ex= pected to be +bound to this context during initialization. + +Along with this default context that always exist, we use a pool of contex= ts that has a +fixed size at domain initialization, where contexts can be allocated (if p= ossible), and +have a id matching their position in the map (considering that id !=3D 0). +These contexts may be used by IOMMU contexts users such as PV-IOMMU or qua= rantine domain +(DomIO). + +# Platform independent context management interface + +A new platform independant interface is introduced in Xen hypervisor to al= low +IOMMU contexts users to create and manage contexts within domains. + +```c +/* Direct context access functions (not supposed to be used directly) */ +#define iommu_default_context(d) (&dom_iommu(d)->default_ctx) +struct iommu_context *iommu_get_context(struct domain *d, u16 ctx_no); +int iommu_context_init(struct domain *d, struct iommu_context *ctx, u16 ct= x_no, u32 flags); +int iommu_context_teardown(struct domain *d, struct iommu_context *ctx, u3= 2 flags); + +/* Check if a specific context exist in the domain, note that ctx_no=3D0 a= lways + exists */ +bool iommu_check_context(struct domain *d, u16 ctx_no); + +/* Flag for default context initialization */ +#define IOMMU_CONTEXT_INIT_default (1 << 0) + +/* Flag for quarantine contexts (scratch page, DMA Abort mode, ...) */ +#define IOMMU_CONTEXT_INIT_quarantine (1 << 1) + +/* Flag to specify that devices will need to be reattached to default doma= in */ +#define IOMMU_TEARDOWN_REATTACH_DEFAULT (1 << 0) + +/* Allocate a new context, uses CONTEXT_INIT flags */ +int iommu_context_alloc(struct domain *d, u16 *ctx_no, u32 flags); + +/* Free a context, uses CONTEXT_TEARDOWN flags */ +int iommu_context_free(struct domain *d, u16 ctx_no, u32 flags); + +/* Move a device from one context to another, including between different = domains. */ +int iommu_reattach_context(struct domain *prev_dom, struct domain *next_do= m, + device_t *dev, u16 ctx_no); + +/* Add a device to a context for first initialization */ +int iommu_attach_context(struct domain *d, device_t *dev, u16 ctx_no); + +/* Remove a device from a context, effectively removing it from the IOMMU.= */ +int iommu_detach_context(struct domain *d, device_t *dev); +``` + +This interface will use a new interface with drivers to implement these fe= atures. + +Some existing functions will have a new parameter to specify on what conte= xt to do the operation. +- iommu_map (iommu_legacy_map untouched) +- iommu_unmap (iommu_legacy_unmap untouched) +- iommu_lookup_page +- iommu_iotlb_flush + +These functions will modify the iommu_context structure to accomodate with= the +operations applied, these functions will be used to replace some operation= s previously +made in the IOMMU driver. + +# IOMMU platform_ops interface changes + +The IOMMU driver needs to expose a way to create and manage IOMMU contexts= , the approach +taken here is to modify the interface to allow specifying a IOMMU context = on operations, +and at the same time, simplifying the interface by relying more on iommu +platform-independent code. + +Added functions in iommu_ops + +```c +/* Initialize a context (creating page tables, allocating hardware, struct= ures, ...) */ +int (*context_init)(struct domain *d, struct iommu_context *ctx, + u32 flags); +/* Destroy a context, assumes no device is bound to the context. */ +int (*context_teardown)(struct domain *d, struct iommu_context *ctx, + u32 flags); +/* Put a device in a context (assumes the device is not attached to anothe= r context) */ +int (*attach)(struct domain *d, device_t *dev, + struct iommu_context *ctx); +/* Remove a device from a context, and from the IOMMU. */ +int (*detach)(struct domain *d, device_t *dev, + struct iommu_context *prev_ctx); +/* Move the device from a context to another, including if the new context= is in + another domain. d corresponds to the target domain. */ +int (*reattach)(struct domain *d, device_t *dev, + struct iommu_context *prev_ctx, + struct iommu_context *ctx); + +#ifdef CONFIG_HAS_PCI +/* Specific interface for phantom function devices. */ +int (*add_devfn)(struct domain *d, struct pci_dev *pdev, u16 devfn, + struct iommu_context *ctx); +int (*remove_devfn)(struct domain *d, struct pci_dev *pdev, u16 devfn, + struct iommu_context *ctx); +#endif + +/* Changes in existing to use a specified iommu_context. */ +int __must_check (*map_page)(struct domain *d, dfn_t dfn, mfn_t mfn, + unsigned int flags, + unsigned int *flush_flags, + struct iommu_context *ctx); +int __must_check (*unmap_page)(struct domain *d, dfn_t dfn, + unsigned int order, + unsigned int *flush_flags, + struct iommu_context *ctx); +int __must_check (*lookup_page)(struct domain *d, dfn_t dfn, mfn_t *mfn, + unsigned int *flags, + struct iommu_context *ctx); + +int __must_check (*iotlb_flush)(struct iommu_context *ctx, dfn_t dfn, + unsigned long page_count, + unsigned int flush_flags); + +void (*clear_root_pgtable)(struct domain *d, struct iommu_context *ctx); +``` + +These functions are redundant with existing functions, therefore, the foll= owing functions +are replaced with new equivalents : +- quarantine_init : platform-independent code and IOMMU_CONTEXT_INIT_quara= ntine flag +- add_device : attach and add_devfn (phantom) +- assign_device : attach and add_devfn (phantom) +- remove_device : detach and remove_devfn (phantom) +- reassign_device : reattach + +Some functionnal differences with previous functions, the following should= be handled +by platform-independent/arch-specific code instead of IOMMU driver : +- identity mappings (unity mappings and rmrr) +- device list in context and domain +- domain of a device +- quarantine + +The idea behind this is to implement IOMMU context features while simplify= ing IOMMU +drivers implementations and ensuring more consistency between IOMMU driver= s. + +## Phantom function handling + +PCI devices may use additionnal devfn to do DMA operations, in order to su= pport such +devices, an interface is added to map specific device functions without im= plying that +the device is mapped to a new context (that may cause duplicates in Xen da= ta structures). + +Functions add_devfn and remove_devfn allows to map a iommu context on spec= ific devfn +for a pci device, without altering platform-independent data structures. + +It is important for the reattach operation to care about these devices, in= order +to prevent devices from being partially reattached to the new context (see= XSA-449 [2]) +by using a all-or-nothing approach for reattaching such devices. + +# Quarantine refactoring using IOMMU contexts + +The quarantine mecanism can be entirely reimplemented using IOMMU context,= making +it simpler, more consistent between platforms, + +Quarantine is currently only supported with x86 platforms and works by cre= ating a +single *hardware IOMMU domain* per quarantined device. All the quarantine = logic is +the implemented in a platform-specific fashion while actually implementing= the same +concepts : + +The *hardware IOMMU context* data structures for quarantine are currently = stored in +the device structure itself (using arch_pci_dev) and IOMMU driver needs to= care about +whether we are dealing with quarantine operations or regular operations (o= ften dealt +using macros such as QUARANTINE_SKIP or DEVICE_PGTABLE). + +The page table that will apply on the quarantined device is created reserv= ed device +regions, and adding mappings to a scratch page if enabled (quarantine=3Dsc= ratch-page). + +A new approach we can use is allowing the quarantine domain (DomIO) to man= age IOMMU +contexts, and implement all the quarantine logic using IOMMU contexts. + +That way, the quarantine implementation can be platform-independent, thus = have a more +consistent implementation between platforms. It will also allows quarantin= e to work +with other IOMMU implementations without having to implement platform-spec= ific behavior. +Moreover, quarantine operations can be implemented using regular context o= perations +instead of relying on driver-specific code. + +Quarantine implementation can be summarised as + +```c +int iommu_quarantine_dev_init(device_t *dev) +{ + int ret; + u16 ctx_no; + + if ( !iommu_quarantine ) + return -EINVAL; + + ret =3D iommu_context_alloc(dom_io, &ctx_no, IOMMU_CONTEXT_INIT_quaran= tine); + + if ( ret ) + return ret; + + /** TODO: Setup scratch page, mappings... */ + + ret =3D iommu_reattach_context(dev->domain, dom_io, dev, ctx_no); + + if ( ret ) + { + ASSERT(!iommu_context_free(dom_io, ctx_no, 0)); + return ret; + } + + return ret; +} +``` + +# Platform-specific considerations + +## Reference counters on target pages + +When mapping a guest page onto a IOMMU context, we need to make sure that +this page is not reused for something else while being actually referenced +by a IOMMU context. One way of doing it is incrementing the reference coun= ter +of each target page we map (excluding reserved regions), and decrementing = it +when the mapping isn't used anymore. + +One consideration to have is when destroying the context while having exis= ting +mappings in it. We can walk through the entire page table and decrement the +reference counter of all mappings. All of that assumes that there is no re= served +region mapped (which should be the case as a requirement of teardown, or a= s a +consequence of REATTACH_DEFAULT flag). + +Another consideration is that the "cleanup mappings" operation may take a = lot +of time depending on the complexity of the page table. Making the teardown= operation preemptable can allow the hypercall to be preempted if needed al= so preventing a malicious +guest from stalling a CPU in a teardown operation with a specially crafted= IOMMU +context (e.g with several 1G superpages). + +## Limit the amount of pages IOMMU contexts can use + +In order to prevent a (eventually malicious) guest from causing too much a= llocations +in Xen, we can enforce limits on the memory the IOMMU subsystem can use fo= r IOMMU context. +A possible implementation can be to preallocate a reasonably large chunk o= f memory +and split it into pages for use by the IOMMU subsystem only for non-defaul= t IOMMU +contexts (e.g PV-IOMMU interface), if this limitation is overcome, some op= erations +may fail from the guest side. These limitations shouldn't impact "usual" o= perations +of the IOMMU subsystem (e.g default context initialization). + +## x86 Architecture + +TODO + +### Intel VT-d + +VT-d uses DID to tag the *IOMMU domain* applied to a device and assumes th= at all entries +with the same DID uses the same page table (i.e same IOMMU context). +Under certain circonstances (e.g DRHD with DID limit below 16-bits), the *= DID* is +transparently converted into a DRHD-specific DID using a map managed inter= nally. + +The current implementation of the code reuses the Xen domain_id as DID. +However, by using multiples IOMMU contexts per domain, we can't use the do= main_id for +contexts (otherwise, different page tables will be mapped with the same DI= D). +The following strategy is used : +- on the default context, reuse the domain_id (the default context is uniq= ue per domain) +- on non-default context, use a id allocated in the pseudo_domid map, (act= ually used by +quarantine) which is a DID outside of Xen domain_id range + +### AMD-Vi + +TODO + +## Device-tree platforms + +### SMMU and SMMUv3 + +TODO + +* * * + +[1] See pv-iommu.md + +[2] pci: phantom functions assigned to incorrect contexts +https://xenbits.xen.org/xsa/advisory-449.html \ No newline at end of file --=20 2.45.2 Teddy Astie | Vates XCP-ng Intern XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech