From nobody Mon Nov 25 02:36:32 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) client-ip=192.237.175.120; envelope-from=xen-devel-bounces@lists.xenproject.org; helo=lists.xenproject.org; Authentication-Results: mx.zohomail.com; dkim=pass header.i=teddy.astie@vates.tech; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass(p=quarantine dis=none) header.from=vates.tech ARC-Seal: i=1; a=rsa-sha256; t=1718281024; cv=none; d=zohomail.com; s=zohoarc; b=C+JT8D5XPvQRwDdB0vjuq2dt9TsCuV2VZwvcUHCHgOFqWeB55w5DXt6DWYI+UQN9e6MI9xSZ812XcQ52cfRSZo+ZrMVtNvekFoAMrz6wZr/dmRrZ+qgMzYs0IzaFqAosl4mll8hB+5FuoRdyr0iGJBhech+iTTZAVWq/xy4aSEI= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1718281024; h=Content-Type:Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=Si1H+0tUp3aV0kgXlGZ06VpmpP6WpGI1yiw22kocZDk=; b=BxABQWqdnNh++VSlo2a8bfIY2c/4vJxqsha3kmm1qbA48CkbhE9fo82MOuA91Aj7qMTQ6WDFOGl5EOE19H0AuNmMC3CrFwcbyKQe6+qXoGpGPM9WsJmNWzMRI5JRX/kY3Z5cQebgjIDHM/WuLwr/6yiS4pQunNeXx9BlQcImB+s= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=teddy.astie@vates.tech; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) by mx.zohomail.com with SMTPS id 1718281024110209.54071029509555; Thu, 13 Jun 2024 05:17:04 -0700 (PDT) Received: from list by lists.xenproject.org with outflank-mailman.739884.1146838 (Exim 4.92) (envelope-from ) id 1sHjNp-0008S7-Gn; Thu, 13 Jun 2024 12:16:49 +0000 Received: by outflank-mailman (output) from mailman id 739884.1146838; Thu, 13 Jun 2024 12:16:49 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1sHjNp-0008S0-De; Thu, 13 Jun 2024 12:16:49 +0000 Received: by outflank-mailman (input) for mailman id 739884; Thu, 13 Jun 2024 12:16:47 +0000 Received: from se1-gles-flk1-in.inumbo.com ([94.247.172.50] helo=se1-gles-flk1.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1sHjNn-0008Rn-Qr for xen-devel@lists.xenproject.org; Thu, 13 Jun 2024 12:16:47 +0000 Received: from mail177-18.suw61.mandrillapp.com (mail177-18.suw61.mandrillapp.com [198.2.177.18]) by se1-gles-flk1.inumbo.com (Halon) with ESMTPS id cc7f90e4-297e-11ef-b4bb-af5377834399; Thu, 13 Jun 2024 14:16:45 +0200 (CEST) Received: from pmta14.mandrill.prod.suw01.rsglab.com (localhost [127.0.0.1]) by mail177-18.suw61.mandrillapp.com (Mailchimp) with ESMTP id 4W0LxR70YbzCf9PPj for ; Thu, 13 Jun 2024 12:16:43 +0000 (GMT) Received: from [37.26.189.201] by mandrillapp.com id a1a12ac15ee84a27862ee32dd567ea52; Thu, 13 Jun 2024 12:16:43 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: cc7f90e4-297e-11ef-b4bb-af5377834399 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; s=mte1; t=1718281004; x=1718541504; bh=Si1H+0tUp3aV0kgXlGZ06VpmpP6WpGI1yiw22kocZDk=; h=From:Subject:To:Cc:Message-Id:In-Reply-To:References:Feedback-ID: Date:MIME-Version:Content-Type:Content-Transfer-Encoding:CC:Date: Subject:From; b=pDTrdiHzEmXxuwfPMWwUWnBUi6F/r2k0j3GA9HUoWlPFey1RzxoLBtM4l8ePfUo2X wjRcYUP9oVNauoOz0lNMN/hZ2ndnXIB42W2gWcfQ/9/P8zVCiPjq5P3MS8DtODaWkB JtFbSgTxm4ktw84BH6eMPOQ3RR43RyEgclPIAKPLkDdW1P4jj8e7VQM6aqK3z/3NYK qo7Fxs+aoDmx93akKzm59NuSxNBSraxlIjY0H9l+sVe8H1FibcEpYQTNAsBf1ZuGLc o281aN34sdfgw4jf2+GJN9Bh3+Ybhiofz8yb+hUSE2l/p7+0bZAuFZB5NkXPmGbLXL 78G9uwWCku6hA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vates.tech; s=mte1; t=1718281004; x=1718541504; i=teddy.astie@vates.tech; bh=Si1H+0tUp3aV0kgXlGZ06VpmpP6WpGI1yiw22kocZDk=; h=From:Subject:To:Cc:Message-Id:In-Reply-To:References:Feedback-ID: Date:MIME-Version:Content-Type:Content-Transfer-Encoding:CC:Date: Subject:From; b=Ha5g06IMmj4caP8Agqckb45mSfuMqSAh4vl1ZbJ3mAl/Dpa8tYiQ0F/4EePW9L+EI R1vXX8SNoLGbVHkxdpER+n+d5CTk61J+PfP4mMY+1l/sNGK9v0Hilq/fjlNVvprJhY rCGQJ9xcsMM8GAMDGCgIkxpuZ15X4CmLMqQGMeNgdPeQS2+LWhiQ+H2r1yioHZUwl0 3tyXEICLLwDAcWvnUL5lkh0vwN+/jdDQm3czP3YO3PkoBtNvrUKZLIXLJe95NRTBtR iODBjOUwf6gbv2cMJfVP7V//T+Yt/nBXD/SVR9O3Hd74J77x26DwnvtT3rK33YwkH2 jXl+i1+13mecA== From: Teddy Astie Subject: =?utf-8?Q?[RFC=20XEN=20PATCH=201/5]=20docs/designs:=20Add=20a=20design=20document=20for=20PV-IOMMU?= X-Mailer: git-send-email 2.45.2 X-Bm-Disclaimer: Yes X-Bm-Milter-Handled: 4ffbd6c1-ee69-4e1b-aabd-f977039bd3e2 X-Bm-Transport-Timestamp: 1718281001992 To: xen-devel@lists.xenproject.org Cc: Teddy Astie , Andrew Cooper , George Dunlap , Jan Beulich , Julien Grall , Stefano Stabellini , =?utf-8?Q?Marek=20Marczykowski-G=C3=B3recki?= Message-Id: <2aa4e20a1d7aeb51f393cde4d142732e18d3a57c.1718269097.git.teddy.astie@vates.tech> In-Reply-To: References: X-Native-Encoded: 1 X-Report-Abuse: =?UTF-8?Q?Please=20forward=20a=20copy=20of=20this=20message,=20including=20all=20headers,=20to=20abuse@mandrill.com.=20You=20can=20also=20report=20abuse=20here:=20https://mandrillapp.com/contact/abuse=3Fid=3D30504962.a1a12ac15ee84a27862ee32dd567ea52?= X-Mandrill-User: md_30504962 Feedback-ID: 30504962:30504962.20240613:md Date: Thu, 13 Jun 2024 12:16:43 +0000 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMail-DKIM: pass (identity @mandrillapp.com) (identity teddy.astie@vates.tech) X-ZM-MESSAGEID: 1718281025369100001 Content-Type: text/plain; charset="utf-8" Some operating systems want to use IOMMU to implement various features (e.g VFIO) or DMA protection. This patch introduce a proposal for IOMMU paravirtualization for Dom0. Signed-off-by Teddy Astie --- docs/designs/pv-iommu.md | 105 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 105 insertions(+) create mode 100644 docs/designs/pv-iommu.md diff --git a/docs/designs/pv-iommu.md b/docs/designs/pv-iommu.md new file mode 100644 index 0000000000..c01062a3ad --- /dev/null +++ b/docs/designs/pv-iommu.md @@ -0,0 +1,105 @@ +# IOMMU paravirtualization for Dom0 + +Status: Experimental + +# Background + +By default, Xen only uses the IOMMU for itself, either to make device adre= ss +space coherent with guest adress space (x86 HVM/PVH) or to prevent devices +from doing DMA outside it's expected memory regions including the hypervis= or +(x86 PV). + +A limitation is that guests (especially privildged ones) may want to use +IOMMU hardware in order to implement features such as DMA protection and +VFIO [1] as IOMMU functionality is not available outside of the hypervisor +currently. + +[1] VFIO - "Virtual Function I/O" - https://www.kernel.org/doc/html/latest= /driver-api/vfio.html + +# Design + +The operating system may want to have access to various IOMMU features suc= h as +context management and DMA remapping. We can create a new hypercall that a= llows +the guest to have access to a new paravirtualized IOMMU interface. + +This feature is only meant to be available for the Dom0, as DomU have some +emulated devices that can't be managed on Xen side and are not hardware, we +can't rely on the hardware IOMMU to enforce DMA remapping. + +This interface is exposed under the `iommu_op` hypercall. + +In addition, Xen domains are modified in order to allow existence of sever= al +IOMMU context including a default one that implement default behavior (e.g +hardware assisted paging) and can't be modified by guest. DomU cannot have +contexts, and therefore act as if they only have the default domain. + +Each IOMMU context within a Xen domain is identified using a domain-specif= ic +context number that is used in the Xen IOMMU subsystem and the hypercall +interface. + +The number of IOMMU context a domain can use is predetermined at domain cr= eation +and is configurable through `dom0-iommu=3Dnb-ctx=3DN` xen cmdline. + +# IOMMU operations + +## Alloc context + +Create a new IOMMU context for the guest and return the context number to = the +guest. +Fail if the IOMMU context limit of the guest is reached. + +A flag can be specified to create a identity mapping. + +## Free context + +Destroy a IOMMU context created previously. +It is not possible to free the default context. + +Reattach context devices to default context if specified by the guest. + +Fail if there is a device in the context and reattach-to-default flag is n= ot +specified. + +## Reattach device + +Reattach a device to another IOMMU context (including the default one). +The target IOMMU context number must be valid and the context allocated. + +The guest needs to specify a PCI SBDF of a device he has access to. + +## Map/unmap page + +Map/unmap a page on a context. +The guest needs to specify a gfn and target dfn to map. + +Refuse to create the mapping if one already exist for the same dfn. + +## Lookup page + +Get the gfn mapped by a specific dfn. + +# Implementation considerations + +## Hypercall batching + +In order to prevent unneeded hypercalls and IOMMU flushing, it is advisabl= e to +be able to batch some critical IOMMU operations (e.g map/unmap multiple pa= ges). + +## Hardware without IOMMU support + +Operating system needs to be aware on PV-IOMMU capability, and whether it = is +able to make contexts. However, some operating system may critically fail = in +case they are able to make a new IOMMU context. Which is supposed to happen +if no IOMMU hardware is available. + +The hypercall interface needs a interface to advertise the ability to crea= te +and manage IOMMU contexts including the amount of context the guest is able +to use. Using these informations, the Dom0 may decide whether to use or not +the PV-IOMMU interface. + +## Page pool for contexts + +In order to prevent unexpected starving on the hypervisor memory with a +buggy Dom0. We can preallocate the pages the contexts will use and make +map/unmap use these pages instead of allocating them dynamically. + --=20 2.45.2 Teddy Astie | Vates XCP-ng Intern XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech From nobody Mon Nov 25 02:36:32 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) client-ip=192.237.175.120; envelope-from=xen-devel-bounces@lists.xenproject.org; helo=lists.xenproject.org; Authentication-Results: mx.zohomail.com; dkim=pass header.i=teddy.astie@vates.tech; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass(p=quarantine dis=none) header.from=vates.tech ARC-Seal: i=1; a=rsa-sha256; t=1718281028; cv=none; d=zohomail.com; s=zohoarc; b=c9erVM8dmWwrc+DH+SinTiMRqJHkaHeUyoJq/gj1jxK6rgDVdAbHEZKqSHITNNMAltFTVtdcYC2A8pi2XE6iOPB/nOzhBFmpt2YG0J2KbZkBlRCRADh/QanMXOqT8sYo46E3n887WinoF6z5h3geKCTpSaYvam4kY3lSIG63yU0= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1718281028; h=Content-Type:Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=hEcTUZt+HEBRIBv50h+VE0TB/aTFBpE2/hWqA5imAgs=; b=S017mapxjEDb/dMv8WOg66PJTJkqeF43QoeXP1v29Y6kEEYu6hFdIHicl+txRd6k+sK7Uc4BnvUXqgdohD1Sq3rBRuP8MbxiI/NQaVK+kIO1d0cWwddcZ1w0bRSwAYEk+0WtMSSVgSBqwbGCJHax4ZiEF7IqDSA/fVBw0zo5eoU= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=teddy.astie@vates.tech; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) by mx.zohomail.com with SMTPS id 1718281028371534.6384625005601; Thu, 13 Jun 2024 05:17:08 -0700 (PDT) Received: from list by lists.xenproject.org with outflank-mailman.739885.1146844 (Exim 4.92) (envelope-from ) id 1sHjNp-00006C-TB; Thu, 13 Jun 2024 12:16:49 +0000 Received: by outflank-mailman (output) from mailman id 739885.1146844; Thu, 13 Jun 2024 12:16:49 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1sHjNp-0008Vl-Pa; Thu, 13 Jun 2024 12:16:49 +0000 Received: by outflank-mailman (input) for mailman id 739885; Thu, 13 Jun 2024 12:16:48 +0000 Received: from se1-gles-flk1-in.inumbo.com ([94.247.172.50] helo=se1-gles-flk1.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1sHjNo-0008Rn-Gu for xen-devel@lists.xenproject.org; Thu, 13 Jun 2024 12:16:48 +0000 Received: from mail187-11.suw11.mandrillapp.com (mail187-11.suw11.mandrillapp.com [198.2.187.11]) by se1-gles-flk1.inumbo.com (Halon) with ESMTPS id cd16be1f-297e-11ef-b4bb-af5377834399; Thu, 13 Jun 2024 14:16:46 +0200 (CEST) Received: from pmta09.mandrill.prod.suw01.rsglab.com (localhost [127.0.0.1]) by mail187-11.suw11.mandrillapp.com (Mailchimp) with ESMTP id 4W0LxT0WFGzLfHVPn for ; Thu, 13 Jun 2024 12:16:45 +0000 (GMT) Received: from [37.26.189.201] by mandrillapp.com id 6ca278561c174d59a5f3ec6b9390c276; Thu, 13 Jun 2024 12:16:45 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: cd16be1f-297e-11ef-b4bb-af5377834399 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; s=mte1; t=1718281005; x=1718541505; bh=hEcTUZt+HEBRIBv50h+VE0TB/aTFBpE2/hWqA5imAgs=; h=From:Subject:To:Cc:Message-Id:In-Reply-To:References:Feedback-ID: Date:MIME-Version:Content-Type:Content-Transfer-Encoding:CC:Date: Subject:From; b=BSZxQNX8u6Zv8CRNV9Dtg4XoE9Cn9svSu8gQTtWK619KGgnT1gEE5rRJaNJrMCNdl fZVOmF01M66Rrxs6isf+CTGOoUG7fi7EB2BeAXpXs5rbmoLOuFAiLotg1gmqXyXg+z r8sZIIEmgZWbkLqIcPYj87emYuBnUJu2dN60eF1cMklgK31xcwIfvOnMF5rsv2kEms Q0IZH+3Ji1R7x+Y+Pkk6QM9wx/cR+mv4PXIVf/ExT6NMnoenEH2ij8O+lWKv279vWy 0ekWVSgvQKoS1dsPmSeLnxXe1Ef1gIsxJsw2TzmT3brhE8JWWUxi6QbynQkhfn7mBo 4BKaTYqmI5bww== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vates.tech; s=mte1; t=1718281005; x=1718541505; i=teddy.astie@vates.tech; bh=hEcTUZt+HEBRIBv50h+VE0TB/aTFBpE2/hWqA5imAgs=; h=From:Subject:To:Cc:Message-Id:In-Reply-To:References:Feedback-ID: Date:MIME-Version:Content-Type:Content-Transfer-Encoding:CC:Date: Subject:From; b=HRwxBk/k1CXOLO0ZeP4qdNRBy5n+UsHg/Smm+/tLaFR54V8JSi9eMs/lzBINGakvf df/XvoqIwJpjZJNy7dbsBcjcp536nuJ7LP9zZ537WquxsMegP99RYu4udEC0B+1Lvq vKnnhpFDI/rYibVxgi9NuIrfVqRr61LsmqRjx5oLJowK8BPe7/7PL9BWTAUwN36u7H v+I0bYhHgJ3RKNl9WDZ6f25iuuV4ua/lPGnTyxFZeEO7fPtZFHHC0RhPKigUuE2OcX V4l6H8kUFNsTpfPLXp/RD6KEXl4Mw3aqDLvcZb4uYxxaf/sKBOYF+STO+qivMXuw1c fUHZdJmMtqU2Q== From: Teddy Astie Subject: =?utf-8?Q?[RFC=20XEN=20PATCH=202/5]=20docs/designs:=20Add=20a=20design=20document=20for=20IOMMU=20subsystem=20redesign?= X-Mailer: git-send-email 2.45.2 X-Bm-Disclaimer: Yes X-Bm-Milter-Handled: 4ffbd6c1-ee69-4e1b-aabd-f977039bd3e2 X-Bm-Transport-Timestamp: 1718281002846 To: xen-devel@lists.xenproject.org Cc: Teddy Astie , Andrew Cooper , George Dunlap , Jan Beulich , Julien Grall , Stefano Stabellini , =?utf-8?Q?Marek=20Marczykowski-G=C3=B3recki?= Message-Id: In-Reply-To: References: X-Native-Encoded: 1 X-Report-Abuse: =?UTF-8?Q?Please=20forward=20a=20copy=20of=20this=20message,=20including=20all=20headers,=20to=20abuse@mandrill.com.=20You=20can=20also=20report=20abuse=20here:=20https://mandrillapp.com/contact/abuse=3Fid=3D30504962.6ca278561c174d59a5f3ec6b9390c276?= X-Mandrill-User: md_30504962 Feedback-ID: 30504962:30504962.20240613:md Date: Thu, 13 Jun 2024 12:16:45 +0000 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMail-DKIM: pass (identity @mandrillapp.com) (identity teddy.astie@vates.tech) X-ZM-MESSAGEID: 1718281029349100002 Content-Type: text/plain; charset="utf-8" Current IOMMU subsystem has some limitations that make PV-IOMMU practically= impossible. One of them is the assumtion that each domain is bound to a single "IOMMU d= omain", which also causes complications with quarantine implementation. Moreover, current IOMMU subsystem is not entirely well-defined, for instanc= e, the behavior of map_page between ARM SMMUv3 and x86 VT-d/AMD-Vi greatly differs. On ARM,= it can modifies the domain page table while on x86, it may be forbidden (e.g using HAP with= PVH), or only modifying the devices PoV (e.g using PV). The goal of this redesign is to define more explicitely the behavior and in= terface of the IOMMU subsystem while allowing PV-IOMMU to be effectively implemented. Signed-off-by Teddy Astie --- docs/designs/iommu-contexts.md | 398 +++++++++++++++++++++++++++++++++ 1 file changed, 398 insertions(+) create mode 100644 docs/designs/iommu-contexts.md diff --git a/docs/designs/iommu-contexts.md b/docs/designs/iommu-contexts.md new file mode 100644 index 0000000000..41bc701bf2 --- /dev/null +++ b/docs/designs/iommu-contexts.md @@ -0,0 +1,398 @@ +# IOMMU context management in Xen + +Status: Experimental +Revision: 0 + +# Background + +The design for *IOMMU paravirtualization for Dom0* [1] explains that some = guests may +want to access to IOMMU features. In order to implement this in Xen, sever= al adjustments +needs to be made to the IOMMU subsystem. + +This "hardware IOMMU domain" is currently implemented on a per-domain basi= s such as each +domain actually has a specific *hardware IOMMU domain*, this design aims t= o allow a +single Xen domain to manage several "IOMMU context", and allow some domain= s (e.g Dom0 +[1]) to modify their IOMMU contexts. + +In addition to this, quarantine feature can be refactored into using IOMMU= contexts +to reduce the complexity of platform-specific implementations and ensuring= more +consistency across platforms. + +# IOMMU context + +We define a "IOMMU context" as being a *hardware IOMMU domain*, but named = as a context +to avoid confusion with Xen domains. +It represents some hardware-specific data structure that contains mappings= from a device +frame-number to a machine frame-number (e.g using a pagetable) that can be= applied to +a device using IOMMU hardware. + +This structure is bound to a Xen domain, but a Xen domain may have several= IOMMU context. +These contexts may be modifiable using the interface as defined in [1] asi= de some +specific cases (e.g modifying default context). + +This is implemented in Xen as a new structure that will hold context-speci= fic +data. + +```c +struct iommu_context { + u16 id; /* Context id (0 means default context) */ + struct list_head devices; + + struct arch_iommu_context arch; + + bool opaque; /* context can't be modified nor accessed (e.g HAP) */ +}; +``` + +A context is identified by a number that is domain-specific and may be use= d by IOMMU +users such as PV-IOMMU by the guest. + +struct arch_iommu_context is splited from struct arch_iommu + +```c +struct arch_iommu_context +{ + spinlock_t pgtables_lock; + struct page_list_head pgtables; + + union { + /* Intel VT-d */ + struct { + uint64_t pgd_maddr; /* io page directory machine address */ + domid_t *didmap; /* per-iommu DID */ + unsigned long *iommu_bitmap; /* bitmap of iommu(s) that the co= ntext uses */ + } vtd; + /* AMD IOMMU */ + struct { + struct page_info *root_table; + } amd; + }; +}; + +struct arch_iommu +{ + spinlock_t mapping_lock; /* io page table lock */ + struct { + struct page_list_head list; + spinlock_t lock; + } pgtables; + + struct list_head identity_maps; + + union { + /* Intel VT-d */ + struct { + /* no more context-specific values */ + unsigned int agaw; /* adjusted guest address width, 0 is level= 2 30-bit */ + } vtd; + /* AMD IOMMU */ + struct { + unsigned int paging_mode; + struct guest_iommu *g_iommu; + } amd; + }; +}; +``` + +IOMMU context information is now carried by iommu_context rather than bein= g integrated to +struct arch_iommu. + +# Xen domain IOMMU structure + +`struct domain_iommu` is modified to allow multiples context within a sing= le Xen domain +to exist : + +```c +struct iommu_context_list { + uint16_t count; /* Context count excluding default context */ + + /* if count > 0 */ + + uint64_t *bitmap; /* bitmap of context allocation */ + struct iommu_context *map; /* Map of contexts */ +}; + +struct domain_iommu { + /* ... */ + + struct iommu_context default_ctx; + struct iommu_context_list other_contexts; + + /* ... */ +} +``` + +default_ctx is a special context with id=3D0 that holds the page table map= ping the entire +domain, which basically preserve the previous behavior. All devices are ex= pected to be +bound to this context during initialization. + +Along with this default context that always exist, we use a pool of contex= ts that has a +fixed size at domain initialization, where contexts can be allocated (if p= ossible), and +have a id matching their position in the map (considering that id !=3D 0). +These contexts may be used by IOMMU contexts users such as PV-IOMMU or qua= rantine domain +(DomIO). + +# Platform independent context management interface + +A new platform independant interface is introduced in Xen hypervisor to al= low +IOMMU contexts users to create and manage contexts within domains. + +```c +/* Direct context access functions (not supposed to be used directly) */ +#define iommu_default_context(d) (&dom_iommu(d)->default_ctx) +struct iommu_context *iommu_get_context(struct domain *d, u16 ctx_no); +int iommu_context_init(struct domain *d, struct iommu_context *ctx, u16 ct= x_no, u32 flags); +int iommu_context_teardown(struct domain *d, struct iommu_context *ctx, u3= 2 flags); + +/* Check if a specific context exist in the domain, note that ctx_no=3D0 a= lways + exists */ +bool iommu_check_context(struct domain *d, u16 ctx_no); + +/* Flag for default context initialization */ +#define IOMMU_CONTEXT_INIT_default (1 << 0) + +/* Flag for quarantine contexts (scratch page, DMA Abort mode, ...) */ +#define IOMMU_CONTEXT_INIT_quarantine (1 << 1) + +/* Flag to specify that devices will need to be reattached to default doma= in */ +#define IOMMU_TEARDOWN_REATTACH_DEFAULT (1 << 0) + +/* Allocate a new context, uses CONTEXT_INIT flags */ +int iommu_context_alloc(struct domain *d, u16 *ctx_no, u32 flags); + +/* Free a context, uses CONTEXT_TEARDOWN flags */ +int iommu_context_free(struct domain *d, u16 ctx_no, u32 flags); + +/* Move a device from one context to another, including between different = domains. */ +int iommu_reattach_context(struct domain *prev_dom, struct domain *next_do= m, + device_t *dev, u16 ctx_no); + +/* Add a device to a context for first initialization */ +int iommu_attach_context(struct domain *d, device_t *dev, u16 ctx_no); + +/* Remove a device from a context, effectively removing it from the IOMMU.= */ +int iommu_dettach_context(struct domain *d, device_t *dev); +``` + +This interface will use a new interface with drivers to implement these fe= atures. + +Some existing functions will have a new parameter to specify on what conte= xt to do the operation. +- iommu_map (iommu_legacy_map untouched) +- iommu_unmap (iommu_legacy_unmap untouched) +- iommu_lookup_page +- iommu_iotlb_flush + +These functions will modify the iommu_context structure to accomodate with= the +operations applied, these functions will be used to replace some operation= s previously +made in the IOMMU driver. + +# IOMMU platform_ops interface changes + +The IOMMU driver needs to expose a way to create and manage IOMMU contexts= , the approach +taken here is to modify the interface to allow specifying a IOMMU context = on operations, +and at the same time, simplifying the interface by relying more on iommu +platform-independent code. + +Added functions in iommu_ops + +```c +/* Initialize a context (creating page tables, allocating hardware, struct= ures, ...) */ +int (*context_init)(struct domain *d, struct iommu_context *ctx, + u32 flags); +/* Destroy a context, assumes no device is bound to the context. */ +int (*context_teardown)(struct domain *d, struct iommu_context *ctx, + u32 flags); +/* Put a device in a context (assumes the device is not attached to anothe= r context) */ +int (*attach)(struct domain *d, device_t *dev, + struct iommu_context *ctx); +/* Remove a device from a context, and from the IOMMU. */ +int (*dettach)(struct domain *d, device_t *dev, + struct iommu_context *prev_ctx); +/* Move the device from a context to another, including if the new context= is in + another domain. d corresponds to the target domain. */ +int (*reattach)(struct domain *d, device_t *dev, + struct iommu_context *prev_ctx, + struct iommu_context *ctx); + +#ifdef CONFIG_HAS_PCI +/* Specific interface for phantom function devices. */ +int (*add_devfn)(struct domain *d, struct pci_dev *pdev, u16 devfn, + struct iommu_context *ctx); +int (*remove_devfn)(struct domain *d, struct pci_dev *pdev, u16 devfn, + struct iommu_context *ctx); +#endif + +/* Changes in existing to use a specified iommu_context. */ +int __must_check (*map_page)(struct domain *d, dfn_t dfn, mfn_t mfn, + unsigned int flags, + unsigned int *flush_flags, + struct iommu_context *ctx); +int __must_check (*unmap_page)(struct domain *d, dfn_t dfn, + unsigned int order, + unsigned int *flush_flags, + struct iommu_context *ctx); +int __must_check (*lookup_page)(struct domain *d, dfn_t dfn, mfn_t *mfn, + unsigned int *flags, + struct iommu_context *ctx); + +int __must_check (*iotlb_flush)(struct iommu_context *ctx, dfn_t dfn, + unsigned long page_count, + unsigned int flush_flags); + +void (*clear_root_pgtable)(struct domain *d, struct iommu_context *ctx); +``` + +These functions are redundant with existing functions, therefore, the foll= owing functions +are replaced with new equivalents : +- quarantine_init : platform-independent code and IOMMU_CONTEXT_INIT_quara= ntine flag +- add_device : attach and add_devfn (phantom) +- assign_device : attach and add_devfn (phantom) +- remove_device : dettach and remove_devfn (phantom) +- reassign_device : reattach + +Some functionnal differences with previous functions, the following should= be handled +by platform-independent/arch-specific code instead of IOMMU driver : +- identity mappings (unity mappings and rmrr) +- device list in context and domain +- domain of a device +- quarantine + +The idea behind this is to implement IOMMU context features while simplify= ing IOMMU +drivers implementations and ensuring more consistency between IOMMU driver= s. + +## Phantom function handling + +PCI devices may use additionnal devfn to do DMA operations, in order to su= pport such +devices, an interface is added to map specific device functions without im= plying that +the device is mapped to a new context (that may cause duplicates in Xen da= ta structures). + +Functions add_devfn and remove_devfn allows to map a iommu context on spec= ific devfn +for a pci device, without altering platform-independent data structures. + +It is important for the reattach operation to care about these devices, in= order +to prevent devices from being partially reattached to the new context (see= XSA-449 [2]) +by using a all-or-nothing approach for reattaching such devices. + +# Quarantine refactoring using IOMMU contexts + +The quarantine mecanism can be entirely reimplemented using IOMMU context,= making +it simpler, more consistent between platforms, + +Quarantine is currently only supported with x86 platforms and works by cre= ating a +single *hardware IOMMU domain* per quarantined device. All the quarantine = logic is +the implemented in a platform-specific fashion while actually implementing= the same +concepts : + +The *hardware IOMMU context* data structures for quarantine are currently = stored in +the device structure itself (using arch_pci_dev) and IOMMU driver needs to= care about +whether we are dealing with quarantine operations or regular operations (o= ften dealt +using macros such as QUARANTINE_SKIP or DEVICE_PGTABLE). + +The page table that will apply on the quarantined device is created reserv= ed device +regions, and adding mappings to a scratch page if enabled (quarantine=3Dsc= ratch-page). + +A new approach we can use is allowing the quarantine domain (DomIO) to man= age IOMMU +contexts, and implement all the quarantine logic using IOMMU contexts. + +That way, the quarantine implementation can be platform-independent, thus = have a more +consistent implementation between platforms. It will also allows quarantin= e to work +with other IOMMU implementations without having to implement platform-spec= ific behavior. +Moreover, quarantine operations can be implemented using regular context o= perations +instead of relying on driver-specific code. + +Quarantine implementation can be summarised as + +```c +int iommu_quarantine_dev_init(device_t *dev) +{ + int ret; + u16 ctx_no; + + if ( !iommu_quarantine ) + return -EINVAL; + + ret =3D iommu_context_alloc(dom_io, &ctx_no, IOMMU_CONTEXT_INIT_quaran= tine); + + if ( ret ) + return ret; + + /** TODO: Setup scratch page, mappings... */ + + ret =3D iommu_reattach_context(dev->domain, dom_io, dev, ctx_no); + + if ( ret ) + { + ASSERT(!iommu_context_free(dom_io, ctx_no, 0)); + return ret; + } + + return ret; +} +``` + +# Platform-specific considerations + +## Reference counters on target pages + +When mapping a guest page onto a IOMMU context, we need to make sure that +this page is not reused for something else while being actually referenced +by a IOMMU context. One way of doing it is incrementing the reference coun= ter +of each target page we map (excluding reserved regions), and decrementing = it +when the mapping isn't used anymore. + +One consideration to have is when destroying the context while having exis= ting +mappings in it. We can walk through the entire page table and decrement the +reference counter of all mappings. All of that assumes that there is no re= served +region mapped (which should be the case as a requirement of teardown, or a= s a +consequence of REATTACH_DEFAULT flag). + +Another consideration is that the "cleanup mappings" operation may take a = lot +of time depending on the complexity of the page table. Making the teardown= operation preemptable can allow the hypercall to be preempted if needed al= so preventing a malicious +guest from stalling a CPU in a teardown operation with a specially crafted= IOMMU +context (e.g with several 1G superpages). + +## Limit the amount of pages IOMMU contexts can use + +In order to prevent a (eventually malicious) guest from causing too much a= llocations +in Xen, we can enforce limits on the memory the IOMMU subsystem can use fo= r IOMMU context. +A possible implementation can be to preallocate a reasonably large chunk o= f memory +and split it into pages for use by the IOMMU subsystem only for non-defaul= t IOMMU +contexts (e.g PV-IOMMU interface), if this limitation is overcome, some op= erations +may fail from the guest side. These limitations shouldn't impact "usual" o= perations +of the IOMMU subsystem (e.g default context initialization). + +## x86 Architecture + +TODO + +### Intel VT-d + +VT-d uses DID to tag the *IOMMU domain* applied to a device and assumes th= at all entries +with the same DID uses the same page table (i.e same IOMMU context). +Under certain circonstances (e.g DRHD with DID limit below 16-bits), the *= DID* is +transparently converted into a DRHD-specific DID using a map managed inter= nally. + +The current implementation of the code reuses the Xen domain_id as DID. +However, by using multiples IOMMU contexts per domain, we can't use the do= main_id for +contexts (otherwise, different page tables will be mapped with the same DI= D). +The following strategy is used : +- on the default context, reuse the domain_id (the default context is uniq= ue per domain) +- on non-default context, use a id allocated in the pseudo_domid map, (act= ually used by +quarantine) which is a DID outside of Xen domain_id range + +### AMD-Vi + +TODO + +## Device-tree platforms + +### SMMU and SMMUv3 + +TODO + +* * * + +[1] See pv-iommu.md + +[2] pci: phantom functions assigned to incorrect contexts +https://xenbits.xen.org/xsa/advisory-449.html \ No newline at end of file --=20 2.45.2 Teddy Astie | Vates XCP-ng Intern XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech From nobody Mon Nov 25 02:36:33 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) client-ip=192.237.175.120; envelope-from=xen-devel-bounces@lists.xenproject.org; helo=lists.xenproject.org; Authentication-Results: mx.zohomail.com; dkim=pass header.i=teddy.astie@vates.tech; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass(p=quarantine dis=none) header.from=vates.tech ARC-Seal: i=1; a=rsa-sha256; t=1718281041; cv=none; d=zohomail.com; s=zohoarc; b=jk5o7nOpZhvbAKDSjURVdTxyz9m/mLYYu/tmzePxc+YCxlF5TVtfu+xW8JyNpWMQac+5G/2+EvqlWXn/Wy1wTMMmpk8wrKSe6PeDBght9QgtfJNNEo4wwQAuZ6b+HCtJQFgrhYzdSyXF1dKrD0h7h7ANjYOVu6SNBNWWPVgkddU= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1718281041; h=Content-Type:Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=Oa/BVbybk5tb2G/uKn7Rk/ZHGV0IpjnUp3ozImGDpyM=; b=Emx4QV264w7aWrswIS7JqMetOcF3BAHn9OYPxvELX1gPOuyvfmGfN8Kgf4g85MMDdbArFNBG+tJkV1SgsEuNnopfviy2bd39zwrwDEUD4hWcKzbLnTPyzOwmaU64/4qidI4PylvGl5r2mnkHJ2P75evfCPuBXyKecz74sHcnO1E= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=teddy.astie@vates.tech; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) by mx.zohomail.com with SMTPS id 171828104099429.504022871576808; Thu, 13 Jun 2024 05:17:20 -0700 (PDT) Received: from list by lists.xenproject.org with outflank-mailman.739889.1146888 (Exim 4.92) (envelope-from ) id 1sHjNv-0001JL-DM; Thu, 13 Jun 2024 12:16:55 +0000 Received: by outflank-mailman (output) from mailman id 739889.1146888; Thu, 13 Jun 2024 12:16:55 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1sHjNv-0001JA-9s; Thu, 13 Jun 2024 12:16:55 +0000 Received: by outflank-mailman (input) for mailman id 739889; Thu, 13 Jun 2024 12:16:54 +0000 Received: from se1-gles-flk1-in.inumbo.com ([94.247.172.50] helo=se1-gles-flk1.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1sHjNu-0008Rn-9Q for xen-devel@lists.xenproject.org; Thu, 13 Jun 2024 12:16:54 +0000 Received: from mail187-11.suw11.mandrillapp.com (mail187-11.suw11.mandrillapp.com [198.2.187.11]) by se1-gles-flk1.inumbo.com (Halon) with ESMTPS id d0050012-297e-11ef-b4bb-af5377834399; Thu, 13 Jun 2024 14:16:51 +0200 (CEST) Received: from pmta09.mandrill.prod.suw01.rsglab.com (localhost [127.0.0.1]) by mail187-11.suw11.mandrillapp.com (Mailchimp) with ESMTP id 4W0LxZ1bqyzLfMFLl for ; Thu, 13 Jun 2024 12:16:50 +0000 (GMT) Received: from [37.26.189.201] by mandrillapp.com id 585464d0f5044fc79899a47ef17fb25e; Thu, 13 Jun 2024 12:16:50 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: d0050012-297e-11ef-b4bb-af5377834399 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; s=mte1; t=1718281010; x=1718541510; bh=Oa/BVbybk5tb2G/uKn7Rk/ZHGV0IpjnUp3ozImGDpyM=; h=From:Subject:To:Cc:Message-Id:In-Reply-To:References:Feedback-ID: Date:MIME-Version:Content-Type:Content-Transfer-Encoding:CC:Date: Subject:From; b=oTOqQx1qVD645I4gdcg+g1sD7EFRdFI24HkBlUzcy2Tv1b1QDYqyxjFRafvwnAxzI xReROCMXc/vX5LOSqZeuooW5mEeiSlR4goWmoQNAv7XFOFU7gDDO3KBCBHBxxtEf8z LaWy1NHnsUu+Qw0yeO7TR+Gy15swOlVlw0nhbPB0KdN/qwP8k27UoKhs71SMlm/4/J 0bf695mExdK988ByOjd+30m+6ldhrXtcvpu3rzcsSYaN6pKNs86EyWK/OwzhZglo5r fMLZ/usGhyDvCzmKLmMSRWsRrpCVdYW4A4AMQ8OrY6taqLuNSy1C/T0fAz3PlOn+a2 5NMbs3ZkkrZGA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vates.tech; s=mte1; t=1718281010; x=1718541510; i=teddy.astie@vates.tech; bh=Oa/BVbybk5tb2G/uKn7Rk/ZHGV0IpjnUp3ozImGDpyM=; h=From:Subject:To:Cc:Message-Id:In-Reply-To:References:Feedback-ID: Date:MIME-Version:Content-Type:Content-Transfer-Encoding:CC:Date: Subject:From; b=JvkJNbGMBClGuqqCbefzFckbO/1FEsyOQtgBpgPwPmcBpfTrD99YKEyB98kPKzc3m lBaSCUzjOMUcEqxP5YFnTRVSoPZgjJGNlSuRihYP7N1FKfeqfmTzO5wUWEDAkcuU4C 7wj0j2asV2NomUb6ck7EJJFf2vAy5mb58/4i3KhbSHNfqK2MhmA41Qe1OYXURnliqz xrfm43BWdGPpWbzpZWpxpz7hMJaQ9Dw1/bU+TZYTK4MbZkV7VvHHUy8ulhXlwnio6G sMGxK5JaNPPXKz4PgYEz8Z2GtpPZshwOUR+diVLUW2T0RNXisDLJLvu6+bSdS1LPv8 ExoLjyq61ne5w== From: Teddy Astie Subject: =?utf-8?Q?[RFC=20XEN=20PATCH=203/5]=20IOMMU:=20Introduce=20redesigned=20IOMMU=20subsystem?= X-Mailer: git-send-email 2.45.2 X-Bm-Disclaimer: Yes X-Bm-Milter-Handled: 4ffbd6c1-ee69-4e1b-aabd-f977039bd3e2 X-Bm-Transport-Timestamp: 1718281003958 To: xen-devel@lists.xenproject.org Cc: Teddy Astie , Jan Beulich , Andrew Cooper , =?utf-8?Q?Roger=20Pau=20Monn=C3=A9?= , George Dunlap , Julien Grall , Stefano Stabellini , Lukasz Hawrylko , "Daniel P. Smith" , =?utf-8?Q?Mateusz=20M=C3=B3wka?= , =?utf-8?Q?Marek=20Marczykowski-G=C3=B3recki?= Message-Id: <99d93c1a8100c0d20d40d80c0e94f46f906a986b.1718269097.git.teddy.astie@vates.tech> In-Reply-To: References: X-Native-Encoded: 1 X-Report-Abuse: =?UTF-8?Q?Please=20forward=20a=20copy=20of=20this=20message,=20including=20all=20headers,=20to=20abuse@mandrill.com.=20You=20can=20also=20report=20abuse=20here:=20https://mandrillapp.com/contact/abuse=3Fid=3D30504962.585464d0f5044fc79899a47ef17fb25e?= X-Mandrill-User: md_30504962 Feedback-ID: 30504962:30504962.20240613:md Date: Thu, 13 Jun 2024 12:16:50 +0000 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMail-DKIM: pass (identity teddy.astie@vates.tech) (identity @mandrillapp.com) X-ZM-MESSAGEID: 1718281041562100001 Content-Type: text/plain; charset="utf-8" Based on docs/designs/iommu-contexts.md, implement the redesigned IOMMU sub= system. Signed-off-by Teddy Astie --- Missing in this RFC Quarantine implementation is incomplete Automatic determination of max ctx_no (maximum IOMMU context count) using on PCI device count. Automatic determination of max ctx_no (for dom_io). Empty/no default IOMMU context mode (UEFI IOMMU based boot). Support for DomU (and configuration using e.g libxl). --- xen/arch/x86/domain.c | 2 +- xen/arch/x86/mm/p2m-ept.c | 2 +- xen/arch/x86/pv/dom0_build.c | 4 +- xen/arch/x86/tboot.c | 4 +- xen/common/memory.c | 4 +- xen/drivers/passthrough/Makefile | 3 + xen/drivers/passthrough/context.c | 626 +++++++++++++++++++++++++++ xen/drivers/passthrough/iommu.c | 333 ++++---------- xen/drivers/passthrough/pci.c | 49 ++- xen/drivers/passthrough/quarantine.c | 49 +++ xen/include/xen/iommu.h | 118 ++++- xen/include/xen/pci.h | 3 + 12 files changed, 897 insertions(+), 300 deletions(-) create mode 100644 xen/drivers/passthrough/context.c create mode 100644 xen/drivers/passthrough/quarantine.c diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index 00a3aaa576..52de634c81 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -2381,7 +2381,7 @@ int domain_relinquish_resources(struct domain *d) =20 PROGRESS(iommu_pagetables): =20 - ret =3D iommu_free_pgtables(d); + ret =3D iommu_free_pgtables(d, iommu_default_context(d)); if ( ret ) return ret; =20 diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c index f83610cb8c..94c3631818 100644 --- a/xen/arch/x86/mm/p2m-ept.c +++ b/xen/arch/x86/mm/p2m-ept.c @@ -970,7 +970,7 @@ out: rc =3D iommu_iotlb_flush(d, _dfn(gfn), 1ul << order, (iommu_flags ? IOMMU_FLUSHF_added : 0) | (vtd_pte_present ? IOMMU_FLUSHF_modified - : 0)); + : 0), 0); else if ( need_iommu_pt_sync(d) ) rc =3D iommu_flags ? iommu_legacy_map(d, _dfn(gfn), mfn, 1ul << order, iommu_fl= ags) : diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c index d8043fa58a..db7298737d 100644 --- a/xen/arch/x86/pv/dom0_build.c +++ b/xen/arch/x86/pv/dom0_build.c @@ -76,7 +76,7 @@ static __init void mark_pv_pt_pages_rdonly(struct domain = *d, * iommu_memory_setup() ended up mapping them. */ if ( need_iommu_pt_sync(d) && - iommu_unmap(d, _dfn(mfn_x(page_to_mfn(page))), 1, 0, flush_fl= ags) ) + iommu_unmap(d, _dfn(mfn_x(page_to_mfn(page))), 1, 0, flush_fl= ags, 0) ) BUG(); =20 /* Read-only mapping + PGC_allocated + page-table page. */ @@ -127,7 +127,7 @@ static void __init iommu_memory_setup(struct domain *d,= const char *what, =20 while ( (rc =3D iommu_map(d, _dfn(mfn_x(mfn)), mfn, nr, IOMMUF_readable | IOMMUF_writable | IOMMUF_pre= empt, - flush_flags)) > 0 ) + flush_flags, 0)) > 0 ) { mfn =3D mfn_add(mfn, rc); nr -=3D rc; diff --git a/xen/arch/x86/tboot.c b/xen/arch/x86/tboot.c index ba0700d2d5..ca55306830 100644 --- a/xen/arch/x86/tboot.c +++ b/xen/arch/x86/tboot.c @@ -216,9 +216,9 @@ static void tboot_gen_domain_integrity(const uint8_t ke= y[TB_KEY_SIZE], =20 if ( is_iommu_enabled(d) && is_vtd ) { - const struct domain_iommu *dio =3D dom_iommu(d); + struct domain_iommu *dio =3D dom_iommu(d); =20 - update_iommu_mac(&ctx, dio->arch.vtd.pgd_maddr, + update_iommu_mac(&ctx, iommu_default_context(d)->arch.vtd.pgd_= maddr, agaw_to_level(dio->arch.vtd.agaw)); } } diff --git a/xen/common/memory.c b/xen/common/memory.c index de2cc7ad92..0eb0f9da7b 100644 --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -925,7 +925,7 @@ int xenmem_add_to_physmap(struct domain *d, struct xen_= add_to_physmap *xatp, this_cpu(iommu_dont_flush_iotlb) =3D 0; =20 ret =3D iommu_iotlb_flush(d, _dfn(xatp->idx - done), done, - IOMMU_FLUSHF_modified); + IOMMU_FLUSHF_modified, 0); if ( unlikely(ret) && rc >=3D 0 ) rc =3D ret; =20 @@ -939,7 +939,7 @@ int xenmem_add_to_physmap(struct domain *d, struct xen_= add_to_physmap *xatp, put_page(pages[i]); =20 ret =3D iommu_iotlb_flush(d, _dfn(xatp->gpfn - done), done, - IOMMU_FLUSHF_added | IOMMU_FLUSHF_modified= ); + IOMMU_FLUSHF_added | IOMMU_FLUSHF_modified= , 0); if ( unlikely(ret) && rc >=3D 0 ) rc =3D ret; } diff --git a/xen/drivers/passthrough/Makefile b/xen/drivers/passthrough/Mak= efile index a1621540b7..69327080ab 100644 --- a/xen/drivers/passthrough/Makefile +++ b/xen/drivers/passthrough/Makefile @@ -4,6 +4,9 @@ obj-$(CONFIG_X86) +=3D x86/ obj-$(CONFIG_ARM) +=3D arm/ =20 obj-y +=3D iommu.o +obj-y +=3D context.o +obj-y +=3D quarantine.o + obj-$(CONFIG_HAS_PCI) +=3D pci.o obj-$(CONFIG_HAS_DEVICE_TREE) +=3D device_tree.o obj-$(CONFIG_HAS_PCI) +=3D ats.o diff --git a/xen/drivers/passthrough/context.c b/xen/drivers/passthrough/co= ntext.c new file mode 100644 index 0000000000..3cc7697164 --- /dev/null +++ b/xen/drivers/passthrough/context.c @@ -0,0 +1,626 @@ +/* + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License f= or + * more details. + * + * You should have received a copy of the GNU General Public License along= with + * this program; If not, see . + */ + +#include +#include +#include +#include +#include +#include + +bool iommu_check_context(struct domain *d, u16 ctx_no) { + struct domain_iommu *hd =3D dom_iommu(d); + + if (ctx_no =3D=3D 0) + return 1; /* Default context always exist. */ + + if ((ctx_no - 1) >=3D hd->other_contexts.count) + return 0; /* out of bounds */ + + return test_bit(ctx_no - 1, hd->other_contexts.bitmap); +} + +struct iommu_context *iommu_get_context(struct domain *d, u16 ctx_no) { + struct domain_iommu *hd =3D dom_iommu(d); + + if (!iommu_check_context(d, ctx_no)) + return NULL; + + if (ctx_no =3D=3D 0) + return &hd->default_ctx; + else + return &hd->other_contexts.map[ctx_no - 1]; +} + +static unsigned int mapping_order(const struct domain_iommu *hd, + dfn_t dfn, mfn_t mfn, unsigned long nr) +{ + unsigned long res =3D dfn_x(dfn) | mfn_x(mfn); + unsigned long sizes =3D hd->platform_ops->page_sizes; + unsigned int bit =3D find_first_set_bit(sizes), order =3D 0; + + ASSERT(bit =3D=3D PAGE_SHIFT); + + while ( (sizes =3D (sizes >> bit) & ~1) ) + { + unsigned long mask; + + bit =3D find_first_set_bit(sizes); + mask =3D (1UL << bit) - 1; + if ( nr <=3D mask || (res & mask) ) + break; + order +=3D bit; + nr >>=3D bit; + res >>=3D bit; + } + + return order; +} + +long _iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0, + unsigned long page_count, unsigned int flags, + unsigned int *flush_flags, u16 ctx_no) +{ + struct domain_iommu *hd =3D dom_iommu(d); + unsigned long i; + unsigned int order, j =3D 0; + int rc =3D 0; + + if ( !is_iommu_enabled(d) ) + return 0; + + if (!iommu_check_context(d, ctx_no)) + return -ENOENT; + + ASSERT(!IOMMUF_order(flags)); + + for ( i =3D 0; i < page_count; i +=3D 1UL << order ) + { + dfn_t dfn =3D dfn_add(dfn0, i); + mfn_t mfn =3D mfn_add(mfn0, i); + + order =3D mapping_order(hd, dfn, mfn, page_count - i); + + if ( (flags & IOMMUF_preempt) && + ((!(++j & 0xfff) && general_preempt_check()) || + i > LONG_MAX - (1UL << order)) ) + return i; + + rc =3D iommu_call(hd->platform_ops, map_page, d, dfn, mfn, + flags | IOMMUF_order(order), flush_flags, + iommu_get_context(d, ctx_no)); + + if ( likely(!rc) ) + continue; + + if ( !d->is_shutting_down && printk_ratelimit() ) + printk(XENLOG_ERR + "d%d: IOMMU mapping dfn %"PRI_dfn" to mfn %"PRI_mfn" fa= iled: %d\n", + d->domain_id, dfn_x(dfn), mfn_x(mfn), rc); + + /* while statement to satisfy __must_check */ + while ( _iommu_unmap(d, dfn0, i, 0, flush_flags, ctx_no) ) + break; + + if ( !ctx_no && !is_hardware_domain(d) ) + domain_crash(d); + + break; + } + + /* + * Something went wrong so, if we were dealing with more than a single + * page, flush everything and clear flush flags. + */ + if ( page_count > 1 && unlikely(rc) && + !iommu_iotlb_flush_all(d, *flush_flags) ) + *flush_flags =3D 0; + + return rc; +} + +long iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0, + unsigned long page_count, unsigned int flags, + unsigned int *flush_flags, u16 ctx_no) +{ + struct domain_iommu *hd =3D dom_iommu(d); + long ret; + + spin_lock(&hd->lock); + ret =3D _iommu_map(d, dfn0, mfn0, page_count, flags, flush_flags, ctx_= no); + spin_unlock(&hd->lock); + + return ret; +} + +int iommu_legacy_map(struct domain *d, dfn_t dfn, mfn_t mfn, + unsigned long page_count, unsigned int flags) +{ + struct domain_iommu *hd =3D dom_iommu(d); + unsigned int flush_flags =3D 0; + int rc; + + ASSERT(!(flags & IOMMUF_preempt)); + + spin_lock(&hd->lock); + rc =3D _iommu_map(d, dfn, mfn, page_count, flags, &flush_flags, 0); + + if ( !this_cpu(iommu_dont_flush_iotlb) && !rc ) + rc =3D _iommu_iotlb_flush(d, dfn, page_count, flush_flags, 0); + spin_unlock(&hd->lock); + + return rc; +} + +long iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count, + unsigned int flags, unsigned int *flush_flags, + u16 ctx_no) +{ + struct domain_iommu *hd =3D dom_iommu(d); + long ret; + + spin_lock(&hd->lock); + ret =3D _iommu_unmap(d, dfn0, page_count, flags, flush_flags, ctx_no); + spin_unlock(&hd->lock); + + return ret; +} + +long _iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count, + unsigned int flags, unsigned int *flush_flags, + u16 ctx_no) +{ + struct domain_iommu *hd =3D dom_iommu(d); + unsigned long i; + unsigned int order, j =3D 0; + int rc =3D 0; + + if ( !is_iommu_enabled(d) ) + return 0; + + if ( !iommu_check_context(d, ctx_no) ) + return -ENOENT; + + ASSERT(!(flags & ~IOMMUF_preempt)); + + for ( i =3D 0; i < page_count; i +=3D 1UL << order ) + { + dfn_t dfn =3D dfn_add(dfn0, i); + int err; + + order =3D mapping_order(hd, dfn, _mfn(0), page_count - i); + + if ( (flags & IOMMUF_preempt) && + ((!(++j & 0xfff) && general_preempt_check()) || + i > LONG_MAX - (1UL << order)) ) + return i; + + err =3D iommu_call(hd->platform_ops, unmap_page, d, dfn, + flags | IOMMUF_order(order), flush_flags, + iommu_get_context(d, ctx_no)); + + if ( likely(!err) ) + continue; + + if ( !d->is_shutting_down && printk_ratelimit() ) + printk(XENLOG_ERR + "d%d: IOMMU unmapping dfn %"PRI_dfn" failed: %d\n", + d->domain_id, dfn_x(dfn), err); + + if ( !rc ) + rc =3D err; + + if ( !is_hardware_domain(d) ) + { + domain_crash(d); + break; + } + } + + /* + * Something went wrong so, if we were dealing with more than a single + * page, flush everything and clear flush flags. + */ + if ( page_count > 1 && unlikely(rc) && + !iommu_iotlb_flush_all(d, *flush_flags) ) + *flush_flags =3D 0; + + return rc; +} + +int iommu_legacy_unmap(struct domain *d, dfn_t dfn, unsigned long page_cou= nt) +{ + unsigned int flush_flags =3D 0; + struct domain_iommu *hd =3D dom_iommu(d); + int rc; + + spin_lock(&hd->lock); + rc =3D _iommu_unmap(d, dfn, page_count, 0, &flush_flags, 0); + + if ( !this_cpu(iommu_dont_flush_iotlb) && !rc ) + rc =3D _iommu_iotlb_flush(d, dfn, page_count, flush_flags, 0); + spin_unlock(&hd->lock); + + return rc; +} + +int iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t *mfn, + unsigned int *flags, u16 ctx_no) +{ + struct domain_iommu *hd =3D dom_iommu(d); + int ret; + + if ( !is_iommu_enabled(d) || !hd->platform_ops->lookup_page ) + return -EOPNOTSUPP; + + if (!iommu_check_context(d, ctx_no)) + return -ENOENT; + + spin_lock(&hd->lock); + ret =3D iommu_call(hd->platform_ops, lookup_page, d, dfn, mfn, flags, = iommu_get_context(d, ctx_no)); + spin_unlock(&hd->lock); + + return ret; +} + +int _iommu_iotlb_flush(struct domain *d, dfn_t dfn, unsigned long page_cou= nt, + unsigned int flush_flags, u16 ctx_no) +{ + struct domain_iommu *hd =3D dom_iommu(d); + int rc; + + if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush || + !page_count || !flush_flags ) + return 0; + + if ( dfn_eq(dfn, INVALID_DFN) ) + return -EINVAL; + + if ( !iommu_check_context(d, ctx_no) ) { + spin_unlock(&hd->lock); + return -ENOENT; + } + + rc =3D iommu_call(hd->platform_ops, iotlb_flush, d, iommu_get_context(= d, ctx_no), + dfn, page_count, flush_flags); + if ( unlikely(rc) ) + { + if ( !d->is_shutting_down && printk_ratelimit() ) + printk(XENLOG_ERR + "d%d: IOMMU IOTLB flush failed: %d, dfn %"PRI_dfn", pag= e count %lu flags %x\n", + d->domain_id, rc, dfn_x(dfn), page_count, flush_flags); + + if ( !is_hardware_domain(d) ) + domain_crash(d); + } + + return rc; +} + +int iommu_iotlb_flush(struct domain *d, dfn_t dfn, unsigned long page_coun= t, + unsigned int flush_flags, u16 ctx_no) +{ + struct domain_iommu *hd =3D dom_iommu(d); + int ret; + + spin_lock(&hd->lock); + ret =3D _iommu_iotlb_flush(d, dfn, page_count, flush_flags, ctx_no); + spin_unlock(&hd->lock); + + return ret; +} + +int iommu_context_init(struct domain *d, struct iommu_context *ctx, u16 ct= x_no, u32 flags) +{ + if ( !dom_iommu(d)->platform_ops->context_init ) + return -ENOSYS; + + INIT_LIST_HEAD(&ctx->devices); + ctx->id =3D ctx_no; + ctx->dying =3D false; + + return iommu_call(dom_iommu(d)->platform_ops, context_init, d, ctx, fl= ags); +} + +int iommu_context_alloc(struct domain *d, u16 *ctx_no, u32 flags) +{ + unsigned int i; + int ret; + struct domain_iommu *hd =3D dom_iommu(d); + + spin_lock(&hd->lock); + + /* TODO: use TSL instead ? */ + i =3D find_first_zero_bit(hd->other_contexts.bitmap, hd->other_context= s.count); + + if ( i < hd->other_contexts.count ) + set_bit(i, hd->other_contexts.bitmap); + + if ( i >=3D hd->other_contexts.count ) /* no free context */ + return -ENOSPC; + + *ctx_no =3D i + 1; + + ret =3D iommu_context_init(d, iommu_get_context(d, *ctx_no), *ctx_no, = flags); + + if ( ret ) + __clear_bit(*ctx_no, hd->other_contexts.bitmap); + + spin_unlock(&hd->lock); + + return ret; +} + +int _iommu_attach_context(struct domain *d, device_t *dev, u16 ctx_no) +{ + struct iommu_context *ctx; + int ret; + + pcidevs_lock(); + + if ( !iommu_check_context(d, ctx_no) ) + { + ret =3D -ENOENT; + goto unlock; + } + + ctx =3D iommu_get_context(d, ctx_no); + + if ( ctx->dying ) + { + ret =3D -EINVAL; + goto unlock; + } + + ret =3D iommu_call(dom_iommu(d)->platform_ops, attach, d, dev, ctx); + + if ( !ret ) + { + dev->context =3D ctx_no; + list_add(&dev->context_list, &ctx->devices); + } + +unlock: + pcidevs_unlock(); + return ret; +} + +int iommu_attach_context(struct domain *d, device_t *dev, u16 ctx_no) +{ + struct domain_iommu *hd =3D dom_iommu(d); + int ret; + + spin_lock(&hd->lock); + ret =3D _iommu_attach_context(d, dev, ctx_no); + spin_unlock(&hd->lock); + + return ret; +} + +int _iommu_dettach_context(struct domain *d, device_t *dev) +{ + struct iommu_context *ctx; + int ret; + + if (!dev->domain) + { + printk("IOMMU: Trying to dettach a non-attached device."); + WARN(); + return 0; + } + + /* Make sure device is actually in the domain. */ + ASSERT(d =3D=3D dev->domain); + + pcidevs_lock(); + + ctx =3D iommu_get_context(d, dev->context); + ASSERT(ctx); /* device is using an invalid context ? + dev->context invalid ? */ + + ret =3D iommu_call(dom_iommu(d)->platform_ops, dettach, d, dev, ctx); + + if ( !ret ) + { + list_del(&dev->context_list); + + /** TODO: Do we need to remove the device from domain ? + * Reattaching to something (quarantine, hardware domain ?) + */ + + /* + * rcu_lock_domain ? + * list_del(&dev->domain_list); + * dev->domain =3D ?; + */ + } + + pcidevs_unlock(); + return ret; +} + +int iommu_dettach_context(struct domain *d, device_t *dev) +{ + int ret; + struct domain_iommu *hd =3D dom_iommu(d); + + spin_lock(&hd->lock); + ret =3D _iommu_dettach_context(d, dev); + spin_unlock(&hd->lock); + + return ret; +} + +int _iommu_reattach_context(struct domain *prev_dom, struct domain *next_d= om, + device_t *dev, u16 ctx_no) +{ + struct domain_iommu *hd; + u16 prev_ctx_no; + device_t *ctx_dev; + struct iommu_context *prev_ctx, *next_ctx; + int ret; + bool same_domain; + + /* Make sure we actually are doing something meaningful */ + BUG_ON(!prev_dom && !next_dom); + + /// TODO: Do such cases exists ? + // /* Platform ops must match */ + // if (dom_iommu(prev_dom)->platform_ops !=3D dom_iommu(next_dom)->pla= tform_ops) + // return -EINVAL; + + pcidevs_lock(); + + if (!prev_dom) + return _iommu_attach_context(next_dom, dev, ctx_no); + + if (!next_dom) + return _iommu_dettach_context(prev_dom, dev); + + hd =3D dom_iommu(prev_dom); + same_domain =3D prev_dom =3D=3D next_dom; + + prev_ctx_no =3D dev->context; + + if ( !same_domain && (ctx_no =3D=3D prev_ctx_no) ) + { + printk(XENLOG_DEBUG "Reattaching %pp to same IOMMU context c%hu\n"= , &dev, ctx_no); + ret =3D 0; + goto unlock; + } + + if ( !iommu_check_context(next_dom, ctx_no) ) + { + ret =3D -ENOENT; + goto unlock; + } + + prev_ctx =3D iommu_get_context(prev_dom, prev_ctx_no); + next_ctx =3D iommu_get_context(next_dom, ctx_no); + + if ( next_ctx->dying ) + { + ret =3D -EINVAL; + goto unlock; + } + + ret =3D iommu_call(hd->platform_ops, reattach, next_dom, dev, prev_ctx, + next_ctx); + + if ( ret ) + goto unlock; + + /* Remove device from previous context, and add it to new one. */ + list_for_each_entry(ctx_dev, &prev_ctx->devices, context_list) + { + if ( ctx_dev =3D=3D dev ) + { + list_del(&ctx_dev->context_list); + list_add(&ctx_dev->context_list, &next_ctx->devices); + break; + } + } + + if ( !same_domain ) + { + /* Update domain pci devices accordingly */ + + /** TODO: should be done here or elsewhere ? */ + } + + if (!ret) + dev->context =3D ctx_no; /* update device context*/ + +unlock: + pcidevs_unlock(); + return ret; +} + +int iommu_reattach_context(struct domain *prev_dom, struct domain *next_do= m, + device_t *dev, u16 ctx_no) +{ + int ret; + struct domain_iommu *prev_hd =3D dom_iommu(prev_dom); + struct domain_iommu *next_hd =3D dom_iommu(next_dom); + + spin_lock(&prev_hd->lock); + + if (prev_dom !=3D next_dom) + spin_lock(&next_hd->lock); + + ret =3D _iommu_reattach_context(prev_dom, next_dom, dev, ctx_no); + + spin_unlock(&prev_hd->lock); + + if (prev_dom !=3D next_dom) + spin_unlock(&next_hd->lock); + + return ret; +} + +int _iommu_context_teardown(struct domain *d, struct iommu_context *ctx, u= 32 flags) +{ + struct domain_iommu *hd =3D dom_iommu(d); + + if ( !dom_iommu(d)->platform_ops->context_teardown ) + return -ENOSYS; + + ctx->dying =3D true; + + /* first reattach devices back to default context if needed */ + if ( flags & IOMMU_TEARDOWN_REATTACH_DEFAULT ) + { + struct pci_dev *device; + list_for_each_entry(device, &ctx->devices, context_list) + _iommu_reattach_context(d, d, device, 0); + } + else if (!list_empty(&ctx->devices)) + return -EBUSY; /* there is a device in context */ + + return iommu_call(hd->platform_ops, context_teardown, d, ctx, flags); +} + +int iommu_context_teardown(struct domain *d, struct iommu_context *ctx, u3= 2 flags) +{ + struct domain_iommu *hd =3D dom_iommu(d); + int ret; + + spin_lock(&hd->lock); + ret =3D _iommu_context_teardown(d, ctx, flags); + spin_unlock(&hd->lock); + + return ret; +} + +int iommu_context_free(struct domain *d, u16 ctx_no, u32 flags) +{ + int ret; + struct domain_iommu *hd =3D dom_iommu(d); + + if ( ctx_no =3D=3D 0 ) + return -EINVAL; + + spin_lock(&hd->lock); + if ( !iommu_check_context(d, ctx_no) ) + return -ENOENT; + + ret =3D _iommu_context_teardown(d, iommu_get_context(d, ctx_no), flags= ); + + if ( !ret ) + clear_bit(ctx_no - 1, hd->other_contexts.bitmap); + + spin_unlock(&hd->lock); + + return ret; +} diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iomm= u.c index ba18136c46..a9e2a8a49b 100644 --- a/xen/drivers/passthrough/iommu.c +++ b/xen/drivers/passthrough/iommu.c @@ -12,6 +12,7 @@ * this program; If not, see . */ =20 +#include #include #include #include @@ -21,6 +22,10 @@ #include #include #include +#include +#include +#include +#include =20 #ifdef CONFIG_X86 #include @@ -35,22 +40,6 @@ bool __read_mostly force_iommu; bool __read_mostly iommu_verbose; static bool __read_mostly iommu_crash_disable; =20 -#define IOMMU_quarantine_none 0 /* aka false */ -#define IOMMU_quarantine_basic 1 /* aka true */ -#define IOMMU_quarantine_scratch_page 2 -#ifdef CONFIG_HAS_PCI -uint8_t __read_mostly iommu_quarantine =3D -# if defined(CONFIG_IOMMU_QUARANTINE_NONE) - IOMMU_quarantine_none; -# elif defined(CONFIG_IOMMU_QUARANTINE_BASIC) - IOMMU_quarantine_basic; -# elif defined(CONFIG_IOMMU_QUARANTINE_SCRATCH_PAGE) - IOMMU_quarantine_scratch_page; -# endif -#else -# define iommu_quarantine IOMMU_quarantine_none -#endif /* CONFIG_HAS_PCI */ - static bool __hwdom_initdata iommu_hwdom_none; bool __hwdom_initdata iommu_hwdom_strict; bool __read_mostly iommu_hwdom_passthrough; @@ -61,6 +50,13 @@ int8_t __hwdom_initdata iommu_hwdom_reserved =3D -1; bool __read_mostly iommu_hap_pt_share =3D true; #endif =20 +uint16_t __read_mostly iommu_hwdom_nb_ctx =3D 8; +bool __read_mostly iommu_hwdom_nb_ctx_forced =3D false; + +#ifdef CONFIG_X86 +unsigned int __read_mostly iommu_hwdom_arena_order =3D CONFIG_X86_ARENA_OR= DER; +#endif + bool __read_mostly iommu_debug; =20 DEFINE_PER_CPU(bool, iommu_dont_flush_iotlb); @@ -156,6 +152,7 @@ static int __init cf_check parse_dom0_iommu_param(const= char *s) int rc =3D 0; =20 do { + long long ll_val; int val; =20 ss =3D strchr(s, ','); @@ -172,6 +169,20 @@ static int __init cf_check parse_dom0_iommu_param(cons= t char *s) iommu_hwdom_reserved =3D val; else if ( !cmdline_strcmp(s, "none") ) iommu_hwdom_none =3D true; + else if ( !parse_signed_integer("nb-ctx", s, ss, &ll_val) ) + { + if (ll_val > 0 && ll_val < UINT16_MAX) + iommu_hwdom_nb_ctx =3D ll_val; + else + printk(XENLOG_WARNING "'nb-ctx=3D%lld' value out of range!= \n", ll_val); + } + else if ( !parse_signed_integer("arena-order", s, ss, &ll_val) ) + { + if (ll_val > 0) + iommu_hwdom_arena_order =3D ll_val; + else + printk(XENLOG_WARNING "'arena-order=3D%lld' value out of r= ange!\n", ll_val); + } else rc =3D -EINVAL; =20 @@ -193,9 +204,26 @@ static void __hwdom_init check_hwdom_reqs(struct domai= n *d) arch_iommu_check_autotranslated_hwdom(d); } =20 +uint16_t __hwdom_init iommu_hwdom_ctx_count(void) +{ + if (iommu_hwdom_nb_ctx_forced) + return iommu_hwdom_nb_ctx; + + /* TODO: Find a proper way of counting devices ? */ + return 256; + + /* + if (iommu_hwdom_nb_ctx !=3D UINT16_MAX) + iommu_hwdom_nb_ctx++; + else + printk(XENLOG_WARNING " IOMMU: Can't prepare more contexts: too mu= ch devices"); + */ +} + int iommu_domain_init(struct domain *d, unsigned int opts) { struct domain_iommu *hd =3D dom_iommu(d); + uint16_t other_context_count; int ret =3D 0; =20 if ( is_hardware_domain(d) ) @@ -236,6 +264,37 @@ int iommu_domain_init(struct domain *d, unsigned int o= pts) =20 ASSERT(!(hd->need_sync && hd->hap_pt_share)); =20 + iommu_hwdom_nb_ctx =3D iommu_hwdom_ctx_count(); + + if ( is_hardware_domain(d) ) + { + BUG_ON(iommu_hwdom_nb_ctx =3D=3D 0); /* sanity check (prevent unde= rflow) */ + printk(XENLOG_INFO "Dom0 uses %lu IOMMU contexts\n", + (unsigned long)iommu_hwdom_nb_ctx); + hd->other_contexts.count =3D iommu_hwdom_nb_ctx - 1; + } + else if ( d =3D=3D dom_io ) + { + /* TODO: Determine count differently */ + hd->other_contexts.count =3D 128; + } + else + hd->other_contexts.count =3D 0; + + other_context_count =3D hd->other_contexts.count; + if (other_context_count > 0) { + /* Initialize context bitmap */ + hd->other_contexts.bitmap =3D xzalloc_array(unsigned long, + BITS_TO_LONGS(other_cont= ext_count)); + hd->other_contexts.map =3D xzalloc_array(struct iommu_context, + other_context_count); + } else { + hd->other_contexts.bitmap =3D NULL; + hd->other_contexts.map =3D NULL; + } + + iommu_context_init(d, &hd->default_ctx, 0, IOMMU_CONTEXT_INIT_default); + return 0; } =20 @@ -249,13 +308,12 @@ static void cf_check iommu_dump_page_tables(unsigned = char key) =20 for_each_domain(d) { - if ( is_hardware_domain(d) || !is_iommu_enabled(d) ) + if ( !is_iommu_enabled(d) ) continue; =20 if ( iommu_use_hap_pt(d) ) { printk("%pd sharing page tables\n", d); - continue; } =20 iommu_vcall(dom_iommu(d)->platform_ops, dump_page_tables, d); @@ -276,10 +334,13 @@ void __hwdom_init iommu_hwdom_init(struct domain *d) iommu_vcall(hd->platform_ops, hwdom_init, d); } =20 -static void iommu_teardown(struct domain *d) +void iommu_domain_destroy(struct domain *d) { struct domain_iommu *hd =3D dom_iommu(d); =20 + if ( !is_iommu_enabled(d) ) + return; + /* * During early domain creation failure, we may reach here with the * ops not yet initialized. @@ -288,224 +349,10 @@ static void iommu_teardown(struct domain *d) return; =20 iommu_vcall(hd->platform_ops, teardown, d); -} - -void iommu_domain_destroy(struct domain *d) -{ - if ( !is_iommu_enabled(d) ) - return; - - iommu_teardown(d); =20 arch_iommu_domain_destroy(d); } =20 -static unsigned int mapping_order(const struct domain_iommu *hd, - dfn_t dfn, mfn_t mfn, unsigned long nr) -{ - unsigned long res =3D dfn_x(dfn) | mfn_x(mfn); - unsigned long sizes =3D hd->platform_ops->page_sizes; - unsigned int bit =3D find_first_set_bit(sizes), order =3D 0; - - ASSERT(bit =3D=3D PAGE_SHIFT); - - while ( (sizes =3D (sizes >> bit) & ~1) ) - { - unsigned long mask; - - bit =3D find_first_set_bit(sizes); - mask =3D (1UL << bit) - 1; - if ( nr <=3D mask || (res & mask) ) - break; - order +=3D bit; - nr >>=3D bit; - res >>=3D bit; - } - - return order; -} - -long iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0, - unsigned long page_count, unsigned int flags, - unsigned int *flush_flags) -{ - const struct domain_iommu *hd =3D dom_iommu(d); - unsigned long i; - unsigned int order, j =3D 0; - int rc =3D 0; - - if ( !is_iommu_enabled(d) ) - return 0; - - ASSERT(!IOMMUF_order(flags)); - - for ( i =3D 0; i < page_count; i +=3D 1UL << order ) - { - dfn_t dfn =3D dfn_add(dfn0, i); - mfn_t mfn =3D mfn_add(mfn0, i); - - order =3D mapping_order(hd, dfn, mfn, page_count - i); - - if ( (flags & IOMMUF_preempt) && - ((!(++j & 0xfff) && general_preempt_check()) || - i > LONG_MAX - (1UL << order)) ) - return i; - - rc =3D iommu_call(hd->platform_ops, map_page, d, dfn, mfn, - flags | IOMMUF_order(order), flush_flags); - - if ( likely(!rc) ) - continue; - - if ( !d->is_shutting_down && printk_ratelimit() ) - printk(XENLOG_ERR - "d%d: IOMMU mapping dfn %"PRI_dfn" to mfn %"PRI_mfn" fa= iled: %d\n", - d->domain_id, dfn_x(dfn), mfn_x(mfn), rc); - - /* while statement to satisfy __must_check */ - while ( iommu_unmap(d, dfn0, i, 0, flush_flags) ) - break; - - if ( !is_hardware_domain(d) ) - domain_crash(d); - - break; - } - - /* - * Something went wrong so, if we were dealing with more than a single - * page, flush everything and clear flush flags. - */ - if ( page_count > 1 && unlikely(rc) && - !iommu_iotlb_flush_all(d, *flush_flags) ) - *flush_flags =3D 0; - - return rc; -} - -int iommu_legacy_map(struct domain *d, dfn_t dfn, mfn_t mfn, - unsigned long page_count, unsigned int flags) -{ - unsigned int flush_flags =3D 0; - int rc; - - ASSERT(!(flags & IOMMUF_preempt)); - rc =3D iommu_map(d, dfn, mfn, page_count, flags, &flush_flags); - - if ( !this_cpu(iommu_dont_flush_iotlb) && !rc ) - rc =3D iommu_iotlb_flush(d, dfn, page_count, flush_flags); - - return rc; -} - -long iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count, - unsigned int flags, unsigned int *flush_flags) -{ - const struct domain_iommu *hd =3D dom_iommu(d); - unsigned long i; - unsigned int order, j =3D 0; - int rc =3D 0; - - if ( !is_iommu_enabled(d) ) - return 0; - - ASSERT(!(flags & ~IOMMUF_preempt)); - - for ( i =3D 0; i < page_count; i +=3D 1UL << order ) - { - dfn_t dfn =3D dfn_add(dfn0, i); - int err; - - order =3D mapping_order(hd, dfn, _mfn(0), page_count - i); - - if ( (flags & IOMMUF_preempt) && - ((!(++j & 0xfff) && general_preempt_check()) || - i > LONG_MAX - (1UL << order)) ) - return i; - - err =3D iommu_call(hd->platform_ops, unmap_page, d, dfn, - flags | IOMMUF_order(order), flush_flags); - - if ( likely(!err) ) - continue; - - if ( !d->is_shutting_down && printk_ratelimit() ) - printk(XENLOG_ERR - "d%d: IOMMU unmapping dfn %"PRI_dfn" failed: %d\n", - d->domain_id, dfn_x(dfn), err); - - if ( !rc ) - rc =3D err; - - if ( !is_hardware_domain(d) ) - { - domain_crash(d); - break; - } - } - - /* - * Something went wrong so, if we were dealing with more than a single - * page, flush everything and clear flush flags. - */ - if ( page_count > 1 && unlikely(rc) && - !iommu_iotlb_flush_all(d, *flush_flags) ) - *flush_flags =3D 0; - - return rc; -} - -int iommu_legacy_unmap(struct domain *d, dfn_t dfn, unsigned long page_cou= nt) -{ - unsigned int flush_flags =3D 0; - int rc =3D iommu_unmap(d, dfn, page_count, 0, &flush_flags); - - if ( !this_cpu(iommu_dont_flush_iotlb) && !rc ) - rc =3D iommu_iotlb_flush(d, dfn, page_count, flush_flags); - - return rc; -} - -int iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t *mfn, - unsigned int *flags) -{ - const struct domain_iommu *hd =3D dom_iommu(d); - - if ( !is_iommu_enabled(d) || !hd->platform_ops->lookup_page ) - return -EOPNOTSUPP; - - return iommu_call(hd->platform_ops, lookup_page, d, dfn, mfn, flags); -} - -int iommu_iotlb_flush(struct domain *d, dfn_t dfn, unsigned long page_coun= t, - unsigned int flush_flags) -{ - const struct domain_iommu *hd =3D dom_iommu(d); - int rc; - - if ( !is_iommu_enabled(d) || !hd->platform_ops->iotlb_flush || - !page_count || !flush_flags ) - return 0; - - if ( dfn_eq(dfn, INVALID_DFN) ) - return -EINVAL; - - rc =3D iommu_call(hd->platform_ops, iotlb_flush, d, dfn, page_count, - flush_flags); - if ( unlikely(rc) ) - { - if ( !d->is_shutting_down && printk_ratelimit() ) - printk(XENLOG_ERR - "d%d: IOMMU IOTLB flush failed: %d, dfn %"PRI_dfn", pag= e count %lu flags %x\n", - d->domain_id, rc, dfn_x(dfn), page_count, flush_flags); - - if ( !is_hardware_domain(d) ) - domain_crash(d); - } - - return rc; -} - int iommu_iotlb_flush_all(struct domain *d, unsigned int flush_flags) { const struct domain_iommu *hd =3D dom_iommu(d); @@ -515,7 +362,7 @@ int iommu_iotlb_flush_all(struct domain *d, unsigned in= t flush_flags) !flush_flags ) return 0; =20 - rc =3D iommu_call(hd->platform_ops, iotlb_flush, d, INVALID_DFN, 0, + rc =3D iommu_call(hd->platform_ops, iotlb_flush, d, NULL, INVALID_DFN,= 0, flush_flags | IOMMU_FLUSHF_all); if ( unlikely(rc) ) { @@ -531,24 +378,6 @@ int iommu_iotlb_flush_all(struct domain *d, unsigned i= nt flush_flags) return rc; } =20 -int iommu_quarantine_dev_init(device_t *dev) -{ - const struct domain_iommu *hd =3D dom_iommu(dom_io); - - if ( !iommu_quarantine || !hd->platform_ops->quarantine_init ) - return 0; - - return iommu_call(hd->platform_ops, quarantine_init, - dev, iommu_quarantine =3D=3D IOMMU_quarantine_scratc= h_page); -} - -static int __init iommu_quarantine_init(void) -{ - dom_io->options |=3D XEN_DOMCTL_CDF_iommu; - - return iommu_domain_init(dom_io, 0); -} - int __init iommu_setup(void) { int rc =3D -ENODEV; diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index 5a446d3dce..46c8a01801 100644 --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -1,6 +1,6 @@ /* * Copyright (C) 2008, Netronome Systems, Inc. - * =20 + * * This program is free software; you can redistribute it and/or modify it * under the terms and conditions of the GNU General Public License, * version 2, as published by the Free Software Foundation. @@ -286,14 +286,14 @@ static void apply_quirks(struct pci_dev *pdev) * Device [8086:2fc0] * Erratum HSE43 * CONFIG_TDP_NOMINAL CSR Implemented at Incorrect Offset - * http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-= v3-spec-update.html=20 + * http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-= v3-spec-update.html */ { PCI_VENDOR_ID_INTEL, 0x2fc0 }, /* * Devices [8086:6f60,6fa0,6fc0] * Errata BDF2 / BDX2 * PCI BARs in the Home Agent Will Return Non-Zero Values During E= numeration - * http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-= v4-spec-update.html=20 + * http://www.intel.com/content/www/us/en/processors/xeon/xeon-e5-= v4-spec-update.html */ { PCI_VENDOR_ID_INTEL, 0x6f60 }, { PCI_VENDOR_ID_INTEL, 0x6fa0 }, @@ -870,8 +870,8 @@ static int deassign_device(struct domain *d, uint16_t s= eg, uint8_t bus, devfn +=3D pdev->phantom_stride; if ( PCI_SLOT(devfn) !=3D PCI_SLOT(pdev->devfn) ) break; - ret =3D iommu_call(hd->platform_ops, reassign_device, d, target, d= evfn, - pci_to_dev(pdev)); + ret =3D iommu_call(hd->platform_ops, add_devfn, d, pci_to_dev(pdev= ), devfn, + &target->iommu.default_ctx); if ( ret ) goto out; } @@ -880,9 +880,9 @@ static int deassign_device(struct domain *d, uint16_t s= eg, uint8_t bus, vpci_deassign_device(pdev); write_unlock(&d->pci_lock); =20 - devfn =3D pdev->devfn; - ret =3D iommu_call(hd->platform_ops, reassign_device, d, target, devfn, - pci_to_dev(pdev)); + ret =3D iommu_call(hd->platform_ops, reattach, target, pci_to_dev(pdev= ), + iommu_get_context(d, pdev->context), + iommu_default_context(target)); if ( ret ) goto out; =20 @@ -890,6 +890,7 @@ static int deassign_device(struct domain *d, uint16_t s= eg, uint8_t bus, pdev->quarantine =3D false; =20 pdev->fault.count =3D 0; + pdev->domain =3D target; =20 write_lock(&target->pci_lock); /* Re-assign back to hardware_domain */ @@ -1329,12 +1330,7 @@ static int cf_check _dump_pci_devices(struct pci_seg= *pseg, void *arg) list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) { printk("%pp - ", &pdev->sbdf); -#ifdef CONFIG_X86 - if ( pdev->domain =3D=3D dom_io ) - printk("DomIO:%x", pdev->arch.pseudo_domid); - else -#endif - printk("%pd", pdev->domain); + printk("%pd", pdev->domain); printk(" - node %-3d", (pdev->node !=3D NUMA_NO_NODE) ? pdev->node= : -1); pdev_dump_msi(pdev); printk("\n"); @@ -1373,7 +1369,7 @@ static int iommu_add_device(struct pci_dev *pdev) if ( !is_iommu_enabled(pdev->domain) ) return 0; =20 - rc =3D iommu_call(hd->platform_ops, add_device, devfn, pci_to_dev(pdev= )); + rc =3D iommu_attach_context(pdev->domain, pci_to_dev(pdev), 0); if ( rc || !pdev->phantom_stride ) return rc; =20 @@ -1382,7 +1378,9 @@ static int iommu_add_device(struct pci_dev *pdev) devfn +=3D pdev->phantom_stride; if ( PCI_SLOT(devfn) !=3D PCI_SLOT(pdev->devfn) ) return 0; - rc =3D iommu_call(hd->platform_ops, add_device, devfn, pci_to_dev(= pdev)); + + rc =3D iommu_call(hd->platform_ops, add_devfn, pdev->domain, pdev,= devfn, + iommu_default_context(pdev->domain)); if ( rc ) printk(XENLOG_WARNING "IOMMU: add %pp failed (%d)\n", &PCI_SBDF(pdev->seg, pdev->bus, devfn), rc); @@ -1409,6 +1407,7 @@ static int iommu_enable_device(struct pci_dev *pdev) static int iommu_remove_device(struct pci_dev *pdev) { const struct domain_iommu *hd; + struct iommu_context *ctx; u8 devfn; =20 if ( !pdev->domain ) @@ -1418,6 +1417,10 @@ static int iommu_remove_device(struct pci_dev *pdev) if ( !is_iommu_enabled(pdev->domain) ) return 0; =20 + ctx =3D iommu_get_context(pdev->domain, pdev->context); + if ( !ctx ) + return -EINVAL; + for ( devfn =3D pdev->devfn ; pdev->phantom_stride; ) { int rc; @@ -1425,8 +1428,8 @@ static int iommu_remove_device(struct pci_dev *pdev) devfn +=3D pdev->phantom_stride; if ( PCI_SLOT(devfn) !=3D PCI_SLOT(pdev->devfn) ) break; - rc =3D iommu_call(hd->platform_ops, remove_device, devfn, - pci_to_dev(pdev)); + rc =3D iommu_call(hd->platform_ops, remove_devfn, pdev->domain, pd= ev, + devfn, ctx); if ( !rc ) continue; =20 @@ -1437,7 +1440,7 @@ static int iommu_remove_device(struct pci_dev *pdev) =20 devfn =3D pdev->devfn; =20 - return iommu_call(hd->platform_ops, remove_device, devfn, pci_to_dev(p= dev)); + return iommu_call(hd->platform_ops, dettach, pdev->domain, pdev, ctx); } =20 static int device_assigned(u16 seg, u8 bus, u8 devfn) @@ -1497,22 +1500,22 @@ static int assign_device(struct domain *d, u16 seg,= u8 bus, u8 devfn, u32 flag) if ( pdev->domain !=3D dom_io ) { rc =3D iommu_quarantine_dev_init(pci_to_dev(pdev)); + /** TODO: Consider phantom functions */ if ( rc ) goto done; } =20 pdev->fault.count =3D 0; =20 - rc =3D iommu_call(hd->platform_ops, assign_device, d, devfn, pci_to_de= v(pdev), - flag); + iommu_attach_context(d, pci_to_dev(pdev), 0); =20 while ( pdev->phantom_stride && !rc ) { devfn +=3D pdev->phantom_stride; if ( PCI_SLOT(devfn) !=3D PCI_SLOT(pdev->devfn) ) break; - rc =3D iommu_call(hd->platform_ops, assign_device, d, devfn, - pci_to_dev(pdev), flag); + rc =3D iommu_call(hd->platform_ops, add_devfn, d, pci_to_dev(pdev), + devfn, iommu_default_context(d)); } =20 if ( rc ) diff --git a/xen/drivers/passthrough/quarantine.c b/xen/drivers/passthrough= /quarantine.c new file mode 100644 index 0000000000..b58f136ad8 --- /dev/null +++ b/xen/drivers/passthrough/quarantine.c @@ -0,0 +1,49 @@ +#include +#include +#include + +#ifdef CONFIG_HAS_PCI +uint8_t __read_mostly iommu_quarantine =3D +# if defined(CONFIG_IOMMU_QUARANTINE_NONE) + IOMMU_quarantine_none; +# elif defined(CONFIG_IOMMU_QUARANTINE_BASIC) + IOMMU_quarantine_basic; +# elif defined(CONFIG_IOMMU_QUARANTINE_SCRATCH_PAGE) + IOMMU_quarantine_scratch_page; +# endif +#else +# define iommu_quarantine IOMMU_quarantine_none +#endif /* CONFIG_HAS_PCI */ + +int iommu_quarantine_dev_init(device_t *dev) +{ + int ret; + u16 ctx_no; + + if ( !iommu_quarantine ) + return 0; + + ret =3D iommu_context_alloc(dom_io, &ctx_no, IOMMU_CONTEXT_INIT_quaran= tine); + + if ( ret ) + return ret; + + /** TODO: Setup scratch page, mappings... */ + + ret =3D iommu_reattach_context(dev->domain, dom_io, dev, ctx_no); + + if ( ret ) + { + ASSERT(!iommu_context_free(dom_io, ctx_no, 0)); + return ret; + } + + return ret; +} + +int __init iommu_quarantine_init(void) +{ + dom_io->options |=3D XEN_DOMCTL_CDF_iommu; + + return iommu_domain_init(dom_io, 0); +} diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h index 442ae5322d..41b0e50827 100644 --- a/xen/include/xen/iommu.h +++ b/xen/include/xen/iommu.h @@ -52,7 +52,11 @@ static inline bool dfn_eq(dfn_t x, dfn_t y) #ifdef CONFIG_HAS_PASSTHROUGH extern bool iommu_enable, iommu_enabled; extern bool force_iommu, iommu_verbose; + /* Boolean except for the specific purposes of drivers/passthrough/iommu.c= . */ +#define IOMMU_quarantine_none 0 /* aka false */ +#define IOMMU_quarantine_basic 1 /* aka true */ +#define IOMMU_quarantine_scratch_page 2 extern uint8_t iommu_quarantine; #else #define iommu_enabled false @@ -107,6 +111,11 @@ extern bool amd_iommu_perdev_intremap; =20 extern bool iommu_hwdom_strict, iommu_hwdom_passthrough, iommu_hwdom_inclu= sive; extern int8_t iommu_hwdom_reserved; +extern uint16_t iommu_hwdom_nb_ctx; + +#ifdef CONFIG_X86 +extern unsigned int iommu_hwdom_arena_order; +#endif =20 extern unsigned int iommu_dev_iotlb_timeout; =20 @@ -161,11 +170,16 @@ enum */ long __must_check iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0, unsigned long page_count, unsigned int flags, - unsigned int *flush_flags); + unsigned int *flush_flags, u16 ctx_no); +long __must_check _iommu_map(struct domain *d, dfn_t dfn0, mfn_t mfn0, + unsigned long page_count, unsigned int flags, + unsigned int *flush_flags, u16 ctx_no); long __must_check iommu_unmap(struct domain *d, dfn_t dfn0, unsigned long page_count, unsigned int flags, - unsigned int *flush_flags); - + unsigned int *flush_flags, u16 ctx_no); +long __must_check _iommu_unmap(struct domain *d, dfn_t dfn0, + unsigned long page_count, unsigned int flag= s, + unsigned int *flush_flags, u16 ctx_no); int __must_check iommu_legacy_map(struct domain *d, dfn_t dfn, mfn_t mfn, unsigned long page_count, unsigned int flags); @@ -173,11 +187,16 @@ int __must_check iommu_legacy_unmap(struct domain *d,= dfn_t dfn, unsigned long page_count); =20 int __must_check iommu_lookup_page(struct domain *d, dfn_t dfn, mfn_t *mfn, - unsigned int *flags); + unsigned int *flags, u16 ctx_no); =20 int __must_check iommu_iotlb_flush(struct domain *d, dfn_t dfn, unsigned long page_count, - unsigned int flush_flags); + unsigned int flush_flags, + u16 ctx_no); +int __must_check _iommu_iotlb_flush(struct domain *d, dfn_t dfn, + unsigned long page_count, + unsigned int flush_flags, + u16 ctx_no); int __must_check iommu_iotlb_flush_all(struct domain *d, unsigned int flush_flags); =20 @@ -250,20 +269,31 @@ struct page_info; */ typedef int iommu_grdm_t(xen_pfn_t start, xen_ulong_t nr, u32 id, void *ct= xt); =20 +struct iommu_context; + struct iommu_ops { unsigned long page_sizes; int (*init)(struct domain *d); void (*hwdom_init)(struct domain *d); - int (*quarantine_init)(device_t *dev, bool scratch_page); - int (*add_device)(uint8_t devfn, device_t *dev); + int (*context_init)(struct domain *d, struct iommu_context *ctx, + u32 flags); + int (*context_teardown)(struct domain *d, struct iommu_context *ctx, + u32 flags); + int (*attach)(struct domain *d, device_t *dev, + struct iommu_context *ctx); + int (*dettach)(struct domain *d, device_t *dev, + struct iommu_context *prev_ctx); + int (*reattach)(struct domain *d, device_t *dev, + struct iommu_context *prev_ctx, + struct iommu_context *ctx); + int (*enable_device)(device_t *dev); - int (*remove_device)(uint8_t devfn, device_t *dev); - int (*assign_device)(struct domain *d, uint8_t devfn, device_t *dev, - uint32_t flag); - int (*reassign_device)(struct domain *s, struct domain *t, - uint8_t devfn, device_t *dev); #ifdef CONFIG_HAS_PCI int (*get_device_group_id)(uint16_t seg, uint8_t bus, uint8_t devfn); + int (*add_devfn)(struct domain *d, struct pci_dev *pdev, u16 devfn, + struct iommu_context *ctx); + int (*remove_devfn)(struct domain *d, struct pci_dev *pdev, u16 devfn, + struct iommu_context *ctx); #endif /* HAS_PCI */ =20 void (*teardown)(struct domain *d); @@ -274,12 +304,15 @@ struct iommu_ops { */ int __must_check (*map_page)(struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags, - unsigned int *flush_flags); + unsigned int *flush_flags, + struct iommu_context *ctx); int __must_check (*unmap_page)(struct domain *d, dfn_t dfn, unsigned int order, - unsigned int *flush_flags); + unsigned int *flush_flags, + struct iommu_context *ctx); int __must_check (*lookup_page)(struct domain *d, dfn_t dfn, mfn_t *mf= n, - unsigned int *flags); + unsigned int *flags, + struct iommu_context *ctx); =20 #ifdef CONFIG_X86 int (*enable_x2apic)(void); @@ -292,14 +325,15 @@ struct iommu_ops { int (*setup_hpet_msi)(struct msi_desc *msi_desc); =20 void (*adjust_irq_affinities)(void); - void (*clear_root_pgtable)(struct domain *d); + void (*clear_root_pgtable)(struct domain *d, struct iommu_context *ctx= ); int (*update_ire_from_msi)(struct msi_desc *msi_desc, struct msi_msg *= msg); #endif /* CONFIG_X86 */ =20 int __must_check (*suspend)(void); void (*resume)(void); void (*crash_shutdown)(void); - int __must_check (*iotlb_flush)(struct domain *d, dfn_t dfn, + int __must_check (*iotlb_flush)(struct domain *d, + struct iommu_context *ctx, dfn_t dfn, unsigned long page_count, unsigned int flush_flags); int (*get_reserved_device_memory)(iommu_grdm_t *func, void *ctxt); @@ -343,11 +377,36 @@ extern int iommu_get_extra_reserved_device_memory(iom= mu_grdm_t *func, # define iommu_vcall iommu_call #endif =20 +struct iommu_context { + u16 id; /* Context id (0 means default context) */ + struct list_head devices; + + struct arch_iommu_context arch; + + bool opaque; /* context can't be modified nor accessed (e.g HAP) */ + bool dying; /* the context is tearing down */ +}; + +struct iommu_context_list { + uint16_t count; /* Context count excluding default context */ + + /* if count > 0 */ + + uint64_t *bitmap; /* bitmap of context allocation */ + struct iommu_context *map; /* Map of contexts */ +}; + + struct domain_iommu { + spinlock_t lock; /* iommu lock */ + #ifdef CONFIG_HAS_PASSTHROUGH struct arch_iommu arch; #endif =20 + struct iommu_context default_ctx; + struct iommu_context_list other_contexts; + /* iommu_ops */ const struct iommu_ops *platform_ops; =20 @@ -380,6 +439,7 @@ struct domain_iommu { #define dom_iommu(d) (&(d)->iommu) #define iommu_set_feature(d, f) set_bit(f, dom_iommu(d)->features) #define iommu_clear_feature(d, f) clear_bit(f, dom_iommu(d)->features) +#define iommu_default_context(d) (&dom_iommu(d)->default_ctx) =20 /* Are we using the domain P2M table as its IOMMU pagetable? */ #define iommu_use_hap_pt(d) (IS_ENABLED(CONFIG_HVM) && \ @@ -405,6 +465,8 @@ int __must_check iommu_suspend(void); void iommu_resume(void); void iommu_crash_shutdown(void); int iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt); + +int __init iommu_quarantine_init(void); int iommu_quarantine_dev_init(device_t *dev); =20 #ifdef CONFIG_HAS_PCI @@ -414,6 +476,28 @@ int iommu_do_pci_domctl(struct xen_domctl *domctl, str= uct domain *d, =20 void iommu_dev_iotlb_flush_timeout(struct domain *d, struct pci_dev *pdev); =20 +struct iommu_context *iommu_get_context(struct domain *d, u16 ctx_no); +bool iommu_check_context(struct domain *d, u16 ctx_no); + +#define IOMMU_CONTEXT_INIT_default (1 << 0) +#define IOMMU_CONTEXT_INIT_quarantine (1 << 1) +int iommu_context_init(struct domain *d, struct iommu_context *ctx, u16 ct= x_no, u32 flags); + +#define IOMMU_TEARDOWN_REATTACH_DEFAULT (1 << 0) +#define IOMMU_TEARDOWN_PREEMPT (1 << 1) +int iommu_context_teardown(struct domain *d, struct iommu_context *ctx, u3= 2 flags); + +int iommu_context_alloc(struct domain *d, u16 *ctx_no, u32 flags); +int iommu_context_free(struct domain *d, u16 ctx_no, u32 flags); + +int iommu_reattach_context(struct domain *prev_dom, struct domain *next_do= m, + device_t *dev, u16 ctx_no); +int iommu_attach_context(struct domain *d, device_t *dev, u16 ctx_no); +int iommu_dettach_context(struct domain *d, device_t *dev); + +int _iommu_attach_context(struct domain *d, device_t *dev, u16 ctx_no); +int _iommu_dettach_context(struct domain *d, device_t *dev); + /* * The purpose of the iommu_dont_flush_iotlb optional cpu flag is to * avoid unecessary iotlb_flush in the low level IOMMU code. diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h index 63e49f0117..d6d4aaa6a5 100644 --- a/xen/include/xen/pci.h +++ b/xen/include/xen/pci.h @@ -97,6 +97,7 @@ struct pci_dev_info { struct pci_dev { struct list_head alldevs_list; struct list_head domain_list; + struct list_head context_list; =20 struct list_head msi_list; =20 @@ -104,6 +105,8 @@ struct pci_dev { =20 struct domain *domain; =20 + uint16_t context; /* IOMMU context number of domain */ + const union { struct { uint8_t devfn; --=20 2.45.2 Teddy Astie | Vates XCP-ng Intern XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech From nobody Mon Nov 25 02:36:33 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) client-ip=192.237.175.120; envelope-from=xen-devel-bounces@lists.xenproject.org; helo=lists.xenproject.org; Authentication-Results: mx.zohomail.com; dkim=pass header.i=teddy.astie@vates.tech; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass(p=quarantine dis=none) header.from=vates.tech ARC-Seal: i=1; a=rsa-sha256; t=1718281037; cv=none; d=zohomail.com; s=zohoarc; b=GKyliNsgnYIoKz/M18ivCGX0TQtftZRW5O3j0nR0oYlX/LaE5++s4bi4pF0an82NSqVhM/PVohii7GmxMZLwqCbDXgcKdr+U5i2XV0gRHRJh9yQpTk0FBGjmxJjWWc6NejcTKqynIPGGemlqaAU0uoWREhqUVt6wNsSjN5fNN5k= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1718281037; h=Content-Type:Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=hyzz5JKkkwDb4eHBd2rl06WDZVc9rBqW/BDYct2UHeM=; b=GtElxDmwGxmid+ml2Fr6UR4M14zEBH9cFmYq6u3uN3zZ2Rs3h4MuPsblWf08ShezTwC/QZkuIPZTBi8kRTY4Kgs/Be57WgY358+bHaVat68DOx5A9fIoYWQB78OWN17WiJM8lxbQdjiaF5szunHDcHdVwatzwhcQ16n9i9KZQtc= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=teddy.astie@vates.tech; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) by mx.zohomail.com with SMTPS id 1718281037242138.60498945703637; Thu, 13 Jun 2024 05:17:17 -0700 (PDT) Received: from list by lists.xenproject.org with outflank-mailman.739888.1146878 (Exim 4.92) (envelope-from ) id 1sHjNt-00011q-SZ; Thu, 13 Jun 2024 12:16:53 +0000 Received: by outflank-mailman (output) from mailman id 739888.1146878; Thu, 13 Jun 2024 12:16:53 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1sHjNt-00011c-P2; Thu, 13 Jun 2024 12:16:53 +0000 Received: by outflank-mailman (input) for mailman id 739888; Thu, 13 Jun 2024 12:16:52 +0000 Received: from se1-gles-flk1-in.inumbo.com ([94.247.172.50] helo=se1-gles-flk1.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1sHjNr-0008Rn-OV for xen-devel@lists.xenproject.org; Thu, 13 Jun 2024 12:16:52 +0000 Received: from mail177-18.suw61.mandrillapp.com (mail177-18.suw61.mandrillapp.com [198.2.177.18]) by se1-gles-flk1.inumbo.com (Halon) with ESMTPS id ce6ee1e1-297e-11ef-b4bb-af5377834399; Thu, 13 Jun 2024 14:16:48 +0200 (CEST) Received: from pmta14.mandrill.prod.suw01.rsglab.com (localhost [127.0.0.1]) by mail177-18.suw61.mandrillapp.com (Mailchimp) with ESMTP id 4W0LxW40WwzCf9Pn2 for ; Thu, 13 Jun 2024 12:16:47 +0000 (GMT) Received: from [37.26.189.201] by mandrillapp.com id f0d5b8db8c4c4466a232666919dd083a; Thu, 13 Jun 2024 12:16:47 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: ce6ee1e1-297e-11ef-b4bb-af5377834399 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; s=mte1; t=1718281007; x=1718541507; bh=hyzz5JKkkwDb4eHBd2rl06WDZVc9rBqW/BDYct2UHeM=; h=From:Subject:To:Cc:Message-Id:In-Reply-To:References:Feedback-ID: Date:MIME-Version:Content-Type:Content-Transfer-Encoding:CC:Date: Subject:From; b=OHTH7yBN/qlZiqvWWcA6ygSAq8qmXyHWixAHXPEzb5ZsSye4vBTcyTS3cJqZqoqxi K6c1TOyD6n6m9upT0L6sCUnDObNdIHnOrt0mtVeR/r6Le7fUHUY2aY0DueGpY0AKeZ UasOtCnrI9dhbWv1REoZXCowOYw6M4FDPrpEgK2Q7XpmiJLuH5k1EWJBRXL2p9nwth Yoqkt6YC0RQoVuzL1iObHRIKRKfVTbA/vZCZVeng9tOJhfiC2JQbikmUmDEFTnrk93 j2uwO12N/6HwDXP2jn2Zl0YJ315poH7PeJAE7htb5L+/tV9JmfCcjLerr2mwPjpZsZ Q3GZAZJarkpKw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vates.tech; s=mte1; t=1718281007; x=1718541507; i=teddy.astie@vates.tech; bh=hyzz5JKkkwDb4eHBd2rl06WDZVc9rBqW/BDYct2UHeM=; h=From:Subject:To:Cc:Message-Id:In-Reply-To:References:Feedback-ID: Date:MIME-Version:Content-Type:Content-Transfer-Encoding:CC:Date: Subject:From; b=faytnum1zLW5jGxVgG/xGoIbyRCXW6EOqdtMZt6nFnb6kyVhIn5/fBCZZSRcxgSJ+ U/Qaizr4EuYVWeekLuIkeFIaZJh/ahtJkCopPFp3BoXrSwJliGUBELsunXUzDUBb2K FO72LQQ8H8fQ9d4o5c8uTQWra3IDmj3xFXSJWvvbbsfwngdMLJLjCzOx+1/p2mPx7b NVawvakillwA52cPxvH00xQw/ev9PCER+KXvNrG04jJUT/sDPgsQzpZNdDptUMwGlX xJa3g7foYlDvAbTxKW914gcFWXe44vUv6ELCWnTnK7MwRZZSwDRtvrndJIeDG/QyPr O5HSFIteIqBkA== From: Teddy Astie Subject: =?utf-8?Q?[RFC=20XEN=20PATCH=204/5]=20VT-d:=20Port=20IOMMU=20driver=20to=20new=20subsystem?= X-Mailer: git-send-email 2.45.2 X-Bm-Disclaimer: Yes X-Bm-Milter-Handled: 4ffbd6c1-ee69-4e1b-aabd-f977039bd3e2 X-Bm-Transport-Timestamp: 1718281004942 To: xen-devel@lists.xenproject.org Cc: Teddy Astie , Jan Beulich , Andrew Cooper , =?utf-8?Q?Roger=20Pau=20Monn=C3=A9?= , =?utf-8?Q?Marek=20Marczykowski-G=C3=B3recki?= Message-Id: In-Reply-To: References: X-Native-Encoded: 1 X-Report-Abuse: =?UTF-8?Q?Please=20forward=20a=20copy=20of=20this=20message,=20including=20all=20headers,=20to=20abuse@mandrill.com.=20You=20can=20also=20report=20abuse=20here:=20https://mandrillapp.com/contact/abuse=3Fid=3D30504962.f0d5b8db8c4c4466a232666919dd083a?= X-Mandrill-User: md_30504962 Feedback-ID: 30504962:30504962.20240613:md Date: Thu, 13 Jun 2024 12:16:47 +0000 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMail-DKIM: pass (identity teddy.astie@vates.tech) (identity @mandrillapp.com) X-ZM-MESSAGEID: 1718281039454100001 Content-Type: text/plain; charset="utf-8" Port the driver with guidances specified in iommu-contexts.md. Add a arena-based allocator for allocating a fixed chunk of memory and split it into 4k pages for use by the IOMMU contexts. This chunk size is configurable with X86_ARENA_ORDER and dom0-iommu=3Darena-order=3DN. Signed-off-by Teddy Astie --- Missing in this RFC Preventing guest from mapping on top of reserved regions. Reattach RMRR failure cleanup code is incomplete and can cause issues with a subsequent teardown operation. --- xen/arch/x86/include/asm/arena.h | 54 + xen/arch/x86/include/asm/iommu.h | 44 +- xen/arch/x86/include/asm/pci.h | 17 - xen/drivers/passthrough/Kconfig | 14 + xen/drivers/passthrough/vtd/Makefile | 2 +- xen/drivers/passthrough/vtd/extern.h | 14 +- xen/drivers/passthrough/vtd/iommu.c | 1555 +++++++++++--------------- xen/drivers/passthrough/vtd/quirks.c | 21 +- xen/drivers/passthrough/x86/Makefile | 1 + xen/drivers/passthrough/x86/arena.c | 157 +++ xen/drivers/passthrough/x86/iommu.c | 104 +- 11 files changed, 980 insertions(+), 1003 deletions(-) create mode 100644 xen/arch/x86/include/asm/arena.h create mode 100644 xen/drivers/passthrough/x86/arena.c diff --git a/xen/arch/x86/include/asm/arena.h b/xen/arch/x86/include/asm/ar= ena.h new file mode 100644 index 0000000000..7555b100e0 --- /dev/null +++ b/xen/arch/x86/include/asm/arena.h @@ -0,0 +1,54 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/** + * Simple arena-based page allocator. + */ + +#ifndef __XEN_IOMMU_ARENA_H__ +#define __XEN_IOMMU_ARENA_H__ + +#include "xen/domain.h" +#include "xen/atomic.h" +#include "xen/mm-frame.h" +#include "xen/types.h" + +/** + * struct page_arena: Page arena structure + */ +struct iommu_arena { + /* mfn of the first page of the memory region */ + mfn_t region_start; + /* bitmap of allocations */ + unsigned long *map; + + /* Order of the arena */ + unsigned int order; + + /* Used page count */ + atomic_t used_pages; +}; + +/** + * Initialize a arena using domheap allocator. + * @param [out] arena Arena to allocate + * @param [in] domain domain that has ownership of arena pages + * @param [in] order order of the arena (power of two of the size) + * @param [in] memflags Flags for domheap_alloc_pages() + * @return -ENOMEM on arena allocation error, 0 otherwise + */ +int iommu_arena_initialize(struct iommu_arena *arena, struct domain *domai= n, + unsigned int order, unsigned int memflags); + +/** + * Teardown a arena. + * @param [out] arena arena to allocate + * @param [in] check check for existing allocations + * @return -EBUSY if check is specified + */ +int iommu_arena_teardown(struct iommu_arena *arena, bool check); + +struct page_info *iommu_arena_allocate_page(struct iommu_arena *arena); +bool iommu_arena_free_page(struct iommu_arena *arena, struct page_info *pa= ge); + +#define iommu_arena_size(arena) (1LLU << (arena)->order) + +#endif diff --git a/xen/arch/x86/include/asm/iommu.h b/xen/arch/x86/include/asm/io= mmu.h index 8dc464fbd3..8fb402f1ee 100644 --- a/xen/arch/x86/include/asm/iommu.h +++ b/xen/arch/x86/include/asm/iommu.h @@ -2,14 +2,18 @@ #ifndef __ARCH_X86_IOMMU_H__ #define __ARCH_X86_IOMMU_H__ =20 +#include #include #include #include #include +#include #include #include #include =20 +#include "arena.h" + #define DEFAULT_DOMAIN_ADDRESS_WIDTH 48 =20 struct g2m_ioport { @@ -31,27 +35,48 @@ typedef uint64_t daddr_t; #define dfn_to_daddr(dfn) __dfn_to_daddr(dfn_x(dfn)) #define daddr_to_dfn(daddr) _dfn(__daddr_to_dfn(daddr)) =20 -struct arch_iommu +struct arch_iommu_context { - spinlock_t mapping_lock; /* io page table lock */ struct { struct page_list_head list; spinlock_t lock; } pgtables; =20 - struct list_head identity_maps; + /* Queue for freeing pages */ + struct page_list_head free_queue; =20 union { /* Intel VT-d */ struct { uint64_t pgd_maddr; /* io page directory machine address */ + domid_t *didmap; /* per-iommu DID */ + unsigned long *iommu_bitmap; /* bitmap of iommu(s) that the co= ntext uses */ + bool duplicated_rmrr; /* tag indicating that duplicated rmrr m= appings are mapped */ + uint32_t superpage_progress; /* superpage progress during tear= down */ + } vtd; + /* AMD IOMMU */ + struct { + struct page_info *root_table; + } amd; + }; +}; + +struct arch_iommu +{ + spinlock_t lock; /* io page table lock */ + struct list_head identity_maps; + + struct iommu_arena pt_arena; /* allocator for non-default contexts */ + + union { + /* Intel VT-d */ + struct { unsigned int agaw; /* adjusted guest address width, 0 is level= 2 30-bit */ - unsigned long *iommu_bitmap; /* bitmap of iommu(s) that the do= main uses */ } vtd; /* AMD IOMMU */ struct { unsigned int paging_mode; - struct page_info *root_table; + struct guest_iommu *g_iommu; } amd; }; }; @@ -128,14 +153,19 @@ unsigned long *iommu_init_domid(domid_t reserve); domid_t iommu_alloc_domid(unsigned long *map); void iommu_free_domid(domid_t domid, unsigned long *map); =20 -int __must_check iommu_free_pgtables(struct domain *d); +struct iommu_context; +int __must_check iommu_free_pgtables(struct domain *d, struct iommu_contex= t *ctx); struct domain_iommu; struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd, + struct iommu_context *c= tx, uint64_t contig_mask); -void iommu_queue_free_pgtable(struct domain_iommu *hd, struct page_info *p= g); +void iommu_queue_free_pgtable(struct iommu_context *ctx, struct page_info = *pg); =20 /* Check [start, end] unity map range for correctness. */ bool iommu_unity_region_ok(const char *prefix, mfn_t start, mfn_t end); +int arch_iommu_context_init(struct domain *d, struct iommu_context *ctx, u= 32 flags); +int arch_iommu_context_teardown(struct domain *d, struct iommu_context *ct= x, u32 flags); +int arch_iommu_flush_free_queue(struct domain *d, struct iommu_context *ct= x); =20 #endif /* !__ARCH_X86_IOMMU_H__ */ /* diff --git a/xen/arch/x86/include/asm/pci.h b/xen/arch/x86/include/asm/pci.h index fd5480d67d..214c1a0948 100644 --- a/xen/arch/x86/include/asm/pci.h +++ b/xen/arch/x86/include/asm/pci.h @@ -15,23 +15,6 @@ =20 struct arch_pci_dev { vmask_t used_vectors; - /* - * These fields are (de)initialized under pcidevs-lock. Other uses of - * them don't race (de)initialization and hence don't strictly need any - * locking. - */ - union { - /* Subset of struct arch_iommu's fields, to be used in dom_io. */ - struct { - uint64_t pgd_maddr; - } vtd; - struct { - struct page_info *root_table; - } amd; - }; - domid_t pseudo_domid; - mfn_t leaf_mfn; - struct page_list_head pgtables_list; }; =20 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf, diff --git a/xen/drivers/passthrough/Kconfig b/xen/drivers/passthrough/Kcon= fig index 78edd80536..1b9f4c8b9c 100644 --- a/xen/drivers/passthrough/Kconfig +++ b/xen/drivers/passthrough/Kconfig @@ -91,3 +91,17 @@ choice config IOMMU_QUARANTINE_SCRATCH_PAGE bool "scratch page" endchoice + +config X86_ARENA_ORDER + int "IOMMU arena order" if EXPERT + depends on X86 + default 9 + help + Specifies the default size of the Dom0 IOMMU arena allocator. + Use 2^order pages for arena. If your system has lots of PCI devices or = if you + encounter IOMMU errors in Dom0, try increasing this value. + This value can be overriden with command-line dom0-iommu=3Darena-order= =3DN. + + [7] 128 pages, 512 KB arena + [9] 512 pages, 2 MB arena (default) + [11] 2048 pages, 8 MB arena \ No newline at end of file diff --git a/xen/drivers/passthrough/vtd/Makefile b/xen/drivers/passthrough= /vtd/Makefile index fde7555fac..81e1f46179 100644 --- a/xen/drivers/passthrough/vtd/Makefile +++ b/xen/drivers/passthrough/vtd/Makefile @@ -5,4 +5,4 @@ obj-y +=3D dmar.o obj-y +=3D utils.o obj-y +=3D qinval.o obj-y +=3D intremap.o -obj-y +=3D quirks.o +obj-y +=3D quirks.o \ No newline at end of file diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough= /vtd/extern.h index 667590ee52..69f808a44a 100644 --- a/xen/drivers/passthrough/vtd/extern.h +++ b/xen/drivers/passthrough/vtd/extern.h @@ -80,12 +80,10 @@ uint64_t alloc_pgtable_maddr(unsigned long npages, node= id_t node); void free_pgtable_maddr(u64 maddr); void *map_vtd_domain_page(u64 maddr); void unmap_vtd_domain_page(const void *va); -int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *io= mmu, - uint8_t bus, uint8_t devfn, - const struct pci_dev *pdev, domid_t domid, - paddr_t pgd_maddr, unsigned int mode); -int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iomm= u, - uint8_t bus, uint8_t devfn); +int apply_context_single(struct domain *domain, struct iommu_context *ctx, + struct vtd_iommu *iommu, uint8_t bus, uint8_t dev= fn); +int unapply_context_single(struct domain *domain, struct iommu_context *ct= x, + struct vtd_iommu *iommu, uint8_t bus, uint8_t d= evfn); int cf_check intel_iommu_get_reserved_device_memory( iommu_grdm_t *func, void *ctxt); =20 @@ -106,8 +104,8 @@ void platform_quirks_init(void); void vtd_ops_preamble_quirk(struct vtd_iommu *iommu); void vtd_ops_postamble_quirk(struct vtd_iommu *iommu); int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus, - uint8_t devfn, domid_t domid, paddr_t pgd_m= addr, - unsigned int mode); + uint8_t devfn, domid_t domid, + unsigned int mode, struct iommu_context *ct= x); void pci_vtd_quirk(const struct pci_dev *); void quirk_iommu_caps(struct vtd_iommu *iommu); =20 diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/= vtd/iommu.c index e13be244c1..068aeed876 100644 --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -30,12 +30,21 @@ #include #include #include +#include +#include #include +#include +#include +#include +#include #include #include #include #include #include +#include +#include +#include #include #include "iommu.h" #include "dmar.h" @@ -46,14 +55,6 @@ #define CONTIG_MASK DMA_PTE_CONTIG_MASK #include =20 -/* dom_io is used as a sentinel for quarantined devices */ -#define QUARANTINE_SKIP(d, pgd_maddr) ((d) =3D=3D dom_io && !(pgd_maddr)) -#define DEVICE_DOMID(d, pdev) ((d) !=3D dom_io ? (d)->domain_id \ - : (pdev)->arch.pseudo_domid) -#define DEVICE_PGTABLE(d, pdev) ((d) !=3D dom_io \ - ? dom_iommu(d)->arch.vtd.pgd_maddr \ - : (pdev)->arch.vtd.pgd_maddr) - bool __read_mostly iommu_igfx =3D true; bool __read_mostly iommu_qinval =3D true; #ifndef iommu_snoop @@ -206,26 +207,14 @@ static bool any_pdev_behind_iommu(const struct domain= *d, * clear iommu in iommu_bitmap and clear domain_id in domid_bitmap. */ static void check_cleanup_domid_map(const struct domain *d, + const struct iommu_context *ctx, const struct pci_dev *exclude, struct vtd_iommu *iommu) { - bool found; - - if ( d =3D=3D dom_io ) - return; - - found =3D any_pdev_behind_iommu(d, exclude, iommu); - /* - * Hidden devices are associated with DomXEN but usable by the hardware - * domain. Hence they need considering here as well. - */ - if ( !found && is_hardware_domain(d) ) - found =3D any_pdev_behind_iommu(dom_xen, exclude, iommu); - - if ( !found ) + if ( !any_pdev_behind_iommu(d, exclude, iommu) ) { - clear_bit(iommu->index, dom_iommu(d)->arch.vtd.iommu_bitmap); - cleanup_domid_map(d->domain_id, iommu); + clear_bit(iommu->index, ctx->arch.vtd.iommu_bitmap); + cleanup_domid_map(ctx->arch.vtd.didmap[iommu->index], iommu); } } =20 @@ -312,8 +301,9 @@ static u64 bus_to_context_maddr(struct vtd_iommu *iommu= , u8 bus) * PTE for the requested address, * - for target =3D=3D 0 the full PTE contents below PADDR_BITS limit. */ -static uint64_t addr_to_dma_page_maddr(struct domain *domain, daddr_t addr, - unsigned int target, +static uint64_t addr_to_dma_page_maddr(struct domain *domain, + struct iommu_context *ctx, + daddr_t addr, unsigned int target, unsigned int *flush_flags, bool all= oc) { struct domain_iommu *hd =3D dom_iommu(domain); @@ -323,10 +313,9 @@ static uint64_t addr_to_dma_page_maddr(struct domain *= domain, daddr_t addr, u64 pte_maddr =3D 0; =20 addr &=3D (((u64)1) << addr_width) - 1; - ASSERT(spin_is_locked(&hd->arch.mapping_lock)); ASSERT(target || !alloc); =20 - if ( !hd->arch.vtd.pgd_maddr ) + if ( !ctx->arch.vtd.pgd_maddr ) { struct page_info *pg; =20 @@ -334,13 +323,13 @@ static uint64_t addr_to_dma_page_maddr(struct domain = *domain, daddr_t addr, goto out; =20 pte_maddr =3D level; - if ( !(pg =3D iommu_alloc_pgtable(hd, 0)) ) + if ( !(pg =3D iommu_alloc_pgtable(hd, ctx, 0)) ) goto out; =20 - hd->arch.vtd.pgd_maddr =3D page_to_maddr(pg); + ctx->arch.vtd.pgd_maddr =3D page_to_maddr(pg); } =20 - pte_maddr =3D hd->arch.vtd.pgd_maddr; + pte_maddr =3D ctx->arch.vtd.pgd_maddr; parent =3D map_vtd_domain_page(pte_maddr); while ( level > target ) { @@ -376,7 +365,7 @@ static uint64_t addr_to_dma_page_maddr(struct domain *d= omain, daddr_t addr, } =20 pte_maddr =3D level - 1; - pg =3D iommu_alloc_pgtable(hd, DMA_PTE_CONTIG_MASK); + pg =3D iommu_alloc_pgtable(hd, ctx, DMA_PTE_CONTIG_MASK); if ( !pg ) break; =20 @@ -428,38 +417,25 @@ static uint64_t addr_to_dma_page_maddr(struct domain = *domain, daddr_t addr, return pte_maddr; } =20 -static paddr_t domain_pgd_maddr(struct domain *d, paddr_t pgd_maddr, - unsigned int nr_pt_levels) +static paddr_t get_context_pgd(struct domain *d, struct iommu_context *ctx, + unsigned int nr_pt_levels) { - struct domain_iommu *hd =3D dom_iommu(d); unsigned int agaw; + paddr_t pgd_maddr =3D ctx->arch.vtd.pgd_maddr; =20 - ASSERT(spin_is_locked(&hd->arch.mapping_lock)); - - if ( pgd_maddr ) - /* nothing */; - else if ( iommu_use_hap_pt(d) ) + if ( !ctx->arch.vtd.pgd_maddr ) { - pagetable_t pgt =3D p2m_get_pagetable(p2m_get_hostp2m(d)); + /* + * Ensure we have pagetables allocated down to the smallest + * level the loop below may need to run to. + */ + addr_to_dma_page_maddr(d, ctx, 0, min_pt_levels, NULL, true); =20 - pgd_maddr =3D pagetable_get_paddr(pgt); + if ( !ctx->arch.vtd.pgd_maddr ) + return 0; } - else - { - if ( !hd->arch.vtd.pgd_maddr ) - { - /* - * Ensure we have pagetables allocated down to the smallest - * level the loop below may need to run to. - */ - addr_to_dma_page_maddr(d, 0, min_pt_levels, NULL, true); - - if ( !hd->arch.vtd.pgd_maddr ) - return 0; - } =20 - pgd_maddr =3D hd->arch.vtd.pgd_maddr; - } + pgd_maddr =3D ctx->arch.vtd.pgd_maddr; =20 /* Skip top level(s) of page tables for less-than-maximum level DRHDs.= */ for ( agaw =3D level_to_agaw(4); @@ -727,28 +703,18 @@ static int __must_check iommu_flush_all(void) return rc; } =20 -static int __must_check cf_check iommu_flush_iotlb(struct domain *d, dfn_t= dfn, +static int __must_check cf_check iommu_flush_iotlb(struct domain *d, + struct iommu_context *c= tx, + dfn_t dfn, unsigned long page_coun= t, unsigned int flush_flag= s) { - struct domain_iommu *hd =3D dom_iommu(d); struct acpi_drhd_unit *drhd; struct vtd_iommu *iommu; bool flush_dev_iotlb; int iommu_domid; int ret =3D 0; =20 - if ( flush_flags & IOMMU_FLUSHF_all ) - { - dfn =3D INVALID_DFN; - page_count =3D 0; - } - else - { - ASSERT(page_count && !dfn_eq(dfn, INVALID_DFN)); - ASSERT(flush_flags); - } - /* * No need pcideves_lock here because we have flush * when assign/deassign device @@ -759,13 +725,20 @@ static int __must_check cf_check iommu_flush_iotlb(st= ruct domain *d, dfn_t dfn, =20 iommu =3D drhd->iommu; =20 - if ( !test_bit(iommu->index, hd->arch.vtd.iommu_bitmap) ) - continue; + if ( ctx ) + { + if ( !test_bit(iommu->index, ctx->arch.vtd.iommu_bitmap) ) + continue; + + iommu_domid =3D get_iommu_did(ctx->arch.vtd.didmap[iommu->inde= x], iommu, true); + + if ( iommu_domid =3D=3D -1 ) + continue; + } + else + iommu_domid =3D 0; =20 flush_dev_iotlb =3D !!find_ats_dev_drhd(iommu); - iommu_domid =3D get_iommu_did(d->domain_id, iommu, !d->is_dying); - if ( iommu_domid =3D=3D -1 ) - continue; =20 if ( !page_count || (page_count & (page_count - 1)) || dfn_eq(dfn, INVALID_DFN) || !IS_ALIGNED(dfn_x(dfn), page_coun= t) ) @@ -784,10 +757,13 @@ static int __must_check cf_check iommu_flush_iotlb(st= ruct domain *d, dfn_t dfn, ret =3D rc; } =20 + if ( !ret && ctx ) + arch_iommu_flush_free_queue(d, ctx); + return ret; } =20 -static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int= level) +static void queue_free_pt(struct iommu_context *ctx, mfn_t mfn, unsigned i= nt level) { if ( level > 1 ) { @@ -796,13 +772,13 @@ static void queue_free_pt(struct domain_iommu *hd, mf= n_t mfn, unsigned int level =20 for ( i =3D 0; i < PTE_NUM; ++i ) if ( dma_pte_present(pt[i]) && !dma_pte_superpage(pt[i]) ) - queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(pt[i])), + queue_free_pt(ctx, maddr_to_mfn(dma_pte_addr(pt[i])), level - 1); =20 unmap_domain_page(pt); } =20 - iommu_queue_free_pgtable(hd, mfn_to_page(mfn)); + iommu_queue_free_pgtable(ctx, mfn_to_page(mfn)); } =20 static int iommu_set_root_entry(struct vtd_iommu *iommu) @@ -1433,11 +1409,6 @@ static int cf_check intel_iommu_domain_init(struct d= omain *d) { struct domain_iommu *hd =3D dom_iommu(d); =20 - hd->arch.vtd.iommu_bitmap =3D xzalloc_array(unsigned long, - BITS_TO_LONGS(nr_iommus)); - if ( !hd->arch.vtd.iommu_bitmap ) - return -ENOMEM; - hd->arch.vtd.agaw =3D width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH); =20 return 0; @@ -1465,32 +1436,22 @@ static void __hwdom_init cf_check intel_iommu_hwdom= _init(struct domain *d) } } =20 -/* - * This function returns - * - a negative errno value upon error, - * - zero upon success when previously the entry was non-present, or this = isn't - * the "main" request for a device (pdev =3D=3D NULL), or for no-op quar= antining - * assignments, - * - positive (one) upon success when previously the entry was present and= this - * is the "main" request for a device (pdev !=3D NULL). +/** + * Apply a context on a device. + * @param domain Domain of the context + * @param iommu IOMMU hardware to use (must match device iommu) + * @param ctx IOMMU context to apply + * @param devfn PCI device function (may be different to pdev) */ -int domain_context_mapping_one( - struct domain *domain, - struct vtd_iommu *iommu, - uint8_t bus, uint8_t devfn, const struct pci_dev *pdev, - domid_t domid, paddr_t pgd_maddr, unsigned int mode) +int apply_context_single(struct domain *domain, struct iommu_context *ctx, + struct vtd_iommu *iommu, uint8_t bus, uint8_t dev= fn) { - struct domain_iommu *hd =3D dom_iommu(domain); struct context_entry *context, *context_entries, lctxt; - __uint128_t old; + __uint128_t res, old; uint64_t maddr; - uint16_t seg =3D iommu->drhd->segment, prev_did =3D 0; - struct domain *prev_dom =3D NULL; + uint16_t seg =3D iommu->drhd->segment, prev_did =3D 0, did; int rc, ret; - bool flush_dev_iotlb; - - if ( QUARANTINE_SKIP(domain, pgd_maddr) ) - return 0; + bool flush_dev_iotlb, overwrite_entry =3D false; =20 ASSERT(pcidevs_locked()); spin_lock(&iommu->lock); @@ -1499,28 +1460,15 @@ int domain_context_mapping_one( context =3D &context_entries[devfn]; old =3D (lctxt =3D *context).full; =20 - if ( context_present(lctxt) ) - { - domid_t domid; + did =3D ctx->arch.vtd.didmap[iommu->index]; =20 + if ( context_present(*context) ) + { prev_did =3D context_domain_id(lctxt); - domid =3D did_to_domain_id(iommu, prev_did); - if ( domid < DOMID_FIRST_RESERVED ) - prev_dom =3D rcu_lock_domain_by_id(domid); - else if ( pdev ? domid =3D=3D pdev->arch.pseudo_domid : domid > DO= MID_MASK ) - prev_dom =3D rcu_lock_domain(dom_io); - if ( !prev_dom ) - { - spin_unlock(&iommu->lock); - unmap_vtd_domain_page(context_entries); - dprintk(XENLOG_DEBUG VTDPREFIX, - "no domain for did %u (nr_dom %u)\n", - prev_did, cap_ndoms(iommu->cap)); - return -ESRCH; - } + overwrite_entry =3D true; } =20 - if ( iommu_hwdom_passthrough && is_hardware_domain(domain) ) + if ( iommu_hwdom_passthrough && is_hardware_domain(domain) && !ctx->id= ) { context_set_translation_type(lctxt, CONTEXT_TT_PASS_THRU); } @@ -1528,16 +1476,10 @@ int domain_context_mapping_one( { paddr_t root; =20 - spin_lock(&hd->arch.mapping_lock); - - root =3D domain_pgd_maddr(domain, pgd_maddr, iommu->nr_pt_levels); + root =3D get_context_pgd(domain, ctx, iommu->nr_pt_levels); if ( !root ) { - spin_unlock(&hd->arch.mapping_lock); - spin_unlock(&iommu->lock); unmap_vtd_domain_page(context_entries); - if ( prev_dom ) - rcu_unlock_domain(prev_dom); return -ENOMEM; } =20 @@ -1546,98 +1488,39 @@ int domain_context_mapping_one( context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB); else context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL); - - spin_unlock(&hd->arch.mapping_lock); } =20 - rc =3D context_set_domain_id(&lctxt, domid, iommu); + rc =3D context_set_domain_id(&lctxt, did, iommu); if ( rc ) - { - unlock: - spin_unlock(&iommu->lock); - unmap_vtd_domain_page(context_entries); - if ( prev_dom ) - rcu_unlock_domain(prev_dom); - return rc; - } - - if ( !prev_dom ) - { - context_set_address_width(lctxt, level_to_agaw(iommu->nr_pt_levels= )); - context_set_fault_enable(lctxt); - context_set_present(lctxt); - } - else if ( prev_dom =3D=3D domain ) - { - ASSERT(lctxt.full =3D=3D context->full); - rc =3D !!pdev; goto unlock; - } - else - { - ASSERT(context_address_width(lctxt) =3D=3D - level_to_agaw(iommu->nr_pt_levels)); - ASSERT(!context_fault_disable(lctxt)); - } - - if ( cpu_has_cx16 ) - { - __uint128_t res =3D cmpxchg16b(context, &old, &lctxt.full); =20 - /* - * Hardware does not update the context entry behind our backs, - * so the return value should match "old". - */ - if ( res !=3D old ) - { - if ( pdev ) - check_cleanup_domid_map(domain, pdev, iommu); - printk(XENLOG_ERR - "%pp: unexpected context entry %016lx_%016lx (expected = %016lx_%016lx)\n", - &PCI_SBDF(seg, bus, devfn), - (uint64_t)(res >> 64), (uint64_t)res, - (uint64_t)(old >> 64), (uint64_t)old); - rc =3D -EILSEQ; - goto unlock; - } - } - else if ( !prev_dom || !(mode & MAP_WITH_RMRR) ) - { - context_clear_present(*context); - iommu_sync_cache(context, sizeof(*context)); + context_set_address_width(lctxt, level_to_agaw(iommu->nr_pt_levels)); + context_set_fault_enable(lctxt); + context_set_present(lctxt); =20 - write_atomic(&context->hi, lctxt.hi); - /* No barrier should be needed between these two. */ - write_atomic(&context->lo, lctxt.lo); - } - else /* Best effort, updating DID last. */ - { - /* - * By non-atomically updating the context entry's DID field last, - * during a short window in time TLB entries with the old domain = ID - * but the new page tables may be inserted. This could affect I/O - * of other devices using this same (old) domain ID. Such updati= ng - * therefore is not a problem if this was the only device associa= ted - * with the old domain ID. Diverting I/O of any of a dying domai= n's - * devices to the quarantine page tables is intended anyway. - */ - if ( !(mode & (MAP_OWNER_DYING | MAP_SINGLE_DEVICE)) ) - printk(XENLOG_WARNING VTDPREFIX - " %pp: reassignment may cause %pd data corruption\n", - &PCI_SBDF(seg, bus, devfn), prev_dom); + res =3D cmpxchg16b(context, &old, &lctxt.full); =20 - write_atomic(&context->lo, lctxt.lo); - /* No barrier should be needed between these two. */ - write_atomic(&context->hi, lctxt.hi); + /* + * Hardware does not update the context entry behind our backs, + * so the return value should match "old". + */ + if ( res !=3D old ) + { + printk(XENLOG_ERR + "%pp: unexpected context entry %016lx_%016lx (expected %01= 6lx_%016lx)\n", + &PCI_SBDF(seg, bus, devfn), + (uint64_t)(res >> 64), (uint64_t)res, + (uint64_t)(old >> 64), (uint64_t)old); + rc =3D -EILSEQ; + goto unlock; } =20 iommu_sync_cache(context, sizeof(struct context_entry)); - spin_unlock(&iommu->lock); =20 rc =3D iommu_flush_context_device(iommu, prev_did, PCI_BDF(bus, devfn), - DMA_CCMD_MASK_NOBIT, !prev_dom); + DMA_CCMD_MASK_NOBIT, !overwrite_entry); flush_dev_iotlb =3D !!find_ats_dev_drhd(iommu); - ret =3D iommu_flush_iotlb_dsi(iommu, prev_did, !prev_dom, flush_dev_io= tlb); + ret =3D iommu_flush_iotlb_dsi(iommu, prev_did, !overwrite_entry, flush= _dev_iotlb); =20 /* * The current logic for returns: @@ -1653,230 +1536,55 @@ int domain_context_mapping_one( if ( rc > 0 ) rc =3D 0; =20 - set_bit(iommu->index, hd->arch.vtd.iommu_bitmap); + set_bit(iommu->index, ctx->arch.vtd.iommu_bitmap); =20 unmap_vtd_domain_page(context_entries); + spin_unlock(&iommu->lock); =20 if ( !seg && !rc ) - rc =3D me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode); - - if ( rc && !(mode & MAP_ERROR_RECOVERY) ) - { - if ( !prev_dom || - /* - * Unmapping here means DEV_TYPE_PCI devices with RMRRs (if s= uch - * exist) would cause problems if such a region was actually - * accessed. - */ - (prev_dom =3D=3D dom_io && !pdev) ) - ret =3D domain_context_unmap_one(domain, iommu, bus, devfn); - else - ret =3D domain_context_mapping_one(prev_dom, iommu, bus, devfn= , pdev, - DEVICE_DOMID(prev_dom, pdev), - DEVICE_PGTABLE(prev_dom, pdev= ), - (mode & MAP_WITH_RMRR) | - MAP_ERROR_RECOVERY) < 0; - - if ( !ret && pdev && pdev->devfn =3D=3D devfn ) - check_cleanup_domid_map(domain, pdev, iommu); - } + rc =3D me_wifi_quirk(domain, bus, devfn, did, 0, ctx); =20 - if ( prev_dom ) - rcu_unlock_domain(prev_dom); + return rc; =20 - return rc ?: pdev && prev_dom; + unlock: + unmap_vtd_domain_page(context_entries); + spin_unlock(&iommu->lock); + return rc; } =20 -static const struct acpi_drhd_unit *domain_context_unmap( - struct domain *d, uint8_t devfn, struct pci_dev *pdev); - -static int domain_context_mapping(struct domain *domain, u8 devfn, - struct pci_dev *pdev) +int apply_context(struct domain *d, struct iommu_context *ctx, + struct pci_dev *pdev, u8 devfn) { const struct acpi_drhd_unit *drhd =3D acpi_find_matched_drhd_unit(pdev= ); - const struct acpi_rmrr_unit *rmrr; - paddr_t pgd_maddr =3D DEVICE_PGTABLE(domain, pdev); - domid_t orig_domid =3D pdev->arch.pseudo_domid; int ret =3D 0; - unsigned int i, mode =3D 0; - uint16_t seg =3D pdev->seg, bdf; - uint8_t bus =3D pdev->bus, secbus; - - /* - * Generally we assume only devices from one node to get assigned to a - * given guest. But even if not, by replacing the prior value here we - * guarantee that at least some basic allocations for the device being - * added will get done against its node. Any further allocations for - * this or other devices may be penalized then, but some would also be - * if we left other than NUMA_NO_NODE untouched here. - */ - if ( drhd && drhd->iommu->node !=3D NUMA_NO_NODE ) - dom_iommu(domain)->node =3D drhd->iommu->node; - - ASSERT(pcidevs_locked()); - - for_each_rmrr_device( rmrr, bdf, i ) - { - if ( rmrr->segment !=3D pdev->seg || bdf !=3D pdev->sbdf.bdf ) - continue; =20 - mode |=3D MAP_WITH_RMRR; - break; - } + if ( !drhd ) + return -EINVAL; =20 - if ( domain !=3D pdev->domain && pdev->domain !=3D dom_io ) + if ( pdev->type =3D=3D DEV_TYPE_PCI_HOST_BRIDGE || + pdev->type =3D=3D DEV_TYPE_PCIe_BRIDGE || + pdev->type =3D=3D DEV_TYPE_PCIe2PCI_BRIDGE || + pdev->type =3D=3D DEV_TYPE_LEGACY_PCI_BRIDGE ) { - if ( pdev->domain->is_dying ) - mode |=3D MAP_OWNER_DYING; - else if ( drhd && - !any_pdev_behind_iommu(pdev->domain, pdev, drhd->iommu) = && - !pdev->phantom_stride ) - mode |=3D MAP_SINGLE_DEVICE; + printk(XENLOG_WARNING VTDPREFIX " Ignoring apply_context on PCI br= idge\n"); + return 0; } =20 - switch ( pdev->type ) - { - bool prev_present; - - case DEV_TYPE_PCI_HOST_BRIDGE: - if ( iommu_debug ) - printk(VTDPREFIX "%pd:Hostbridge: skip %pp map\n", - domain, &PCI_SBDF(seg, bus, devfn)); - if ( !is_hardware_domain(domain) ) - return -EPERM; - break; - - case DEV_TYPE_PCIe_BRIDGE: - case DEV_TYPE_PCIe2PCI_BRIDGE: - case DEV_TYPE_LEGACY_PCI_BRIDGE: - break; - - case DEV_TYPE_PCIe_ENDPOINT: - if ( !drhd ) - return -ENODEV; - - if ( iommu_quarantine && orig_domid =3D=3D DOMID_INVALID ) - { - pdev->arch.pseudo_domid =3D - iommu_alloc_domid(drhd->iommu->pseudo_domid_map); - if ( pdev->arch.pseudo_domid =3D=3D DOMID_INVALID ) - return -ENOSPC; - } - - if ( iommu_debug ) - printk(VTDPREFIX "%pd:PCIe: map %pp\n", - domain, &PCI_SBDF(seg, bus, devfn)); - ret =3D domain_context_mapping_one(domain, drhd->iommu, bus, devfn= , pdev, - DEVICE_DOMID(domain, pdev), pgd_m= addr, - mode); - if ( ret > 0 ) - ret =3D 0; - if ( !ret && devfn =3D=3D pdev->devfn && ats_device(pdev, drhd) > = 0 ) - enable_ats_device(pdev, &drhd->iommu->ats_devices); - - break; - - case DEV_TYPE_PCI: - if ( !drhd ) - return -ENODEV; - - if ( iommu_quarantine && orig_domid =3D=3D DOMID_INVALID ) - { - pdev->arch.pseudo_domid =3D - iommu_alloc_domid(drhd->iommu->pseudo_domid_map); - if ( pdev->arch.pseudo_domid =3D=3D DOMID_INVALID ) - return -ENOSPC; - } - - if ( iommu_debug ) - printk(VTDPREFIX "%pd:PCI: map %pp\n", - domain, &PCI_SBDF(seg, bus, devfn)); - - ret =3D domain_context_mapping_one(domain, drhd->iommu, bus, devfn, - pdev, DEVICE_DOMID(domain, pdev), - pgd_maddr, mode); - if ( ret < 0 ) - break; - prev_present =3D ret; - - if ( (ret =3D find_upstream_bridge(seg, &bus, &devfn, &secbus)) < = 1 ) - { - if ( !ret ) - break; - ret =3D -ENXIO; - } - /* - * Strictly speaking if the device is the only one behind this bri= dge - * and the only one with this (secbus,0,0) tuple, it could be allo= wed - * to be re-assigned regardless of RMRR presence. But let's deal = with - * that case only if it is actually found in the wild. Note that - * dealing with this just here would still not render the operation - * secure. - */ - else if ( prev_present && (mode & MAP_WITH_RMRR) && - domain !=3D pdev->domain ) - ret =3D -EOPNOTSUPP; - - /* - * Mapping a bridge should, if anything, pass the struct pci_dev of - * that bridge. Since bridges don't normally get assigned to guest= s, - * their owner would be the wrong one. Pass NULL instead. - */ - if ( ret >=3D 0 ) - ret =3D domain_context_mapping_one(domain, drhd->iommu, bus, d= evfn, - NULL, DEVICE_DOMID(domain, pd= ev), - pgd_maddr, mode); - - /* - * Devices behind PCIe-to-PCI/PCIx bridge may generate different - * requester-id. It may originate from devfn=3D0 on the secondary = bus - * behind the bridge. Map that id as well if we didn't already. - * - * Somewhat similar as for bridges, we don't want to pass a struct - * pci_dev here - there may not even exist one for this (secbus,0,= 0) - * tuple. If there is one, without properly working device groups = it - * may again not have the correct owner. - */ - if ( !ret && pdev_type(seg, bus, devfn) =3D=3D DEV_TYPE_PCIe2PCI_B= RIDGE && - (secbus !=3D pdev->bus || pdev->devfn !=3D 0) ) - ret =3D domain_context_mapping_one(domain, drhd->iommu, secbus= , 0, - NULL, DEVICE_DOMID(domain, pd= ev), - pgd_maddr, mode); - - if ( ret ) - { - if ( !prev_present ) - domain_context_unmap(domain, devfn, pdev); - else if ( pdev->domain !=3D domain ) /* Avoid infinite recursi= on. */ - domain_context_mapping(pdev->domain, devfn, pdev); - } + ASSERT(pcidevs_locked()); =20 - break; + ret =3D apply_context_single(d, ctx, drhd->iommu, pdev->bus, devfn); =20 - default: - dprintk(XENLOG_ERR VTDPREFIX, "%pd:unknown(%u): %pp\n", - domain, pdev->type, &PCI_SBDF(seg, bus, devfn)); - ret =3D -EINVAL; - break; - } + if ( !ret && ats_device(pdev, drhd) > 0 ) + enable_ats_device(pdev, &drhd->iommu->ats_devices); =20 if ( !ret && devfn =3D=3D pdev->devfn ) pci_vtd_quirk(pdev); =20 - if ( ret && drhd && orig_domid =3D=3D DOMID_INVALID ) - { - iommu_free_domid(pdev->arch.pseudo_domid, - drhd->iommu->pseudo_domid_map); - pdev->arch.pseudo_domid =3D DOMID_INVALID; - } - return ret; } =20 -int domain_context_unmap_one( - struct domain *domain, - struct vtd_iommu *iommu, - uint8_t bus, uint8_t devfn) +int unapply_context_single(struct domain *domain, struct iommu_context *ct= x, + struct vtd_iommu *iommu, uint8_t bus, uint8_t d= evfn) { struct context_entry *context, *context_entries; u64 maddr; @@ -1928,8 +1636,8 @@ int domain_context_unmap_one( unmap_vtd_domain_page(context_entries); =20 if ( !iommu->drhd->segment && !rc ) - rc =3D me_wifi_quirk(domain, bus, devfn, DOMID_INVALID, 0, - UNMAP_ME_PHANTOM_FUNC); + rc =3D me_wifi_quirk(domain, bus, devfn, DOMID_INVALID, UNMAP_ME_P= HANTOM_FUNC, + NULL); =20 if ( rc && !is_hardware_domain(domain) && domain !=3D dom_io ) { @@ -1947,105 +1655,14 @@ int domain_context_unmap_one( return rc; } =20 -static const struct acpi_drhd_unit *domain_context_unmap( - struct domain *domain, - uint8_t devfn, - struct pci_dev *pdev) -{ - const struct acpi_drhd_unit *drhd =3D acpi_find_matched_drhd_unit(pdev= ); - struct vtd_iommu *iommu =3D drhd ? drhd->iommu : NULL; - int ret; - uint16_t seg =3D pdev->seg; - uint8_t bus =3D pdev->bus, tmp_bus, tmp_devfn, secbus; - - switch ( pdev->type ) - { - case DEV_TYPE_PCI_HOST_BRIDGE: - if ( iommu_debug ) - printk(VTDPREFIX "%pd:Hostbridge: skip %pp unmap\n", - domain, &PCI_SBDF(seg, bus, devfn)); - return ERR_PTR(is_hardware_domain(domain) ? 0 : -EPERM); - - case DEV_TYPE_PCIe_BRIDGE: - case DEV_TYPE_PCIe2PCI_BRIDGE: - case DEV_TYPE_LEGACY_PCI_BRIDGE: - return ERR_PTR(0); - - case DEV_TYPE_PCIe_ENDPOINT: - if ( !iommu ) - return ERR_PTR(-ENODEV); - - if ( iommu_debug ) - printk(VTDPREFIX "%pd:PCIe: unmap %pp\n", - domain, &PCI_SBDF(seg, bus, devfn)); - ret =3D domain_context_unmap_one(domain, iommu, bus, devfn); - if ( !ret && devfn =3D=3D pdev->devfn && ats_device(pdev, drhd) > = 0 ) - disable_ats_device(pdev); - - break; - - case DEV_TYPE_PCI: - if ( !iommu ) - return ERR_PTR(-ENODEV); - - if ( iommu_debug ) - printk(VTDPREFIX "%pd:PCI: unmap %pp\n", - domain, &PCI_SBDF(seg, bus, devfn)); - ret =3D domain_context_unmap_one(domain, iommu, bus, devfn); - if ( ret ) - break; - - tmp_bus =3D bus; - tmp_devfn =3D devfn; - if ( (ret =3D find_upstream_bridge(seg, &tmp_bus, &tmp_devfn, - &secbus)) < 1 ) - { - if ( ret ) - { - ret =3D -ENXIO; - if ( !domain->is_dying && - !is_hardware_domain(domain) && domain !=3D dom_io ) - { - domain_crash(domain); - /* Make upper layers continue in a best effort manner.= */ - ret =3D 0; - } - } - break; - } - - ret =3D domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn= ); - /* PCIe to PCI/PCIx bridge */ - if ( !ret && pdev_type(seg, tmp_bus, tmp_devfn) =3D=3D DEV_TYPE_PC= Ie2PCI_BRIDGE ) - ret =3D domain_context_unmap_one(domain, iommu, secbus, 0); - - break; - - default: - dprintk(XENLOG_ERR VTDPREFIX, "%pd:unknown(%u): %pp\n", - domain, pdev->type, &PCI_SBDF(seg, bus, devfn)); - return ERR_PTR(-EINVAL); - } - - if ( !ret && pdev->devfn =3D=3D devfn && - !QUARANTINE_SKIP(domain, pdev->arch.vtd.pgd_maddr) ) - check_cleanup_domid_map(domain, pdev, iommu); - - return drhd; -} - -static void cf_check iommu_clear_root_pgtable(struct domain *d) +static void cf_check iommu_clear_root_pgtable(struct domain *d, struct iom= mu_context *ctx) { - struct domain_iommu *hd =3D dom_iommu(d); - - spin_lock(&hd->arch.mapping_lock); - hd->arch.vtd.pgd_maddr =3D 0; - spin_unlock(&hd->arch.mapping_lock); + ctx->arch.vtd.pgd_maddr =3D 0; } =20 static void cf_check iommu_domain_teardown(struct domain *d) { - struct domain_iommu *hd =3D dom_iommu(d); + struct iommu_context *ctx =3D iommu_default_context(d); const struct acpi_drhd_unit *drhd; =20 if ( list_empty(&acpi_drhd_units) ) @@ -2053,37 +1670,15 @@ static void cf_check iommu_domain_teardown(struct d= omain *d) =20 iommu_identity_map_teardown(d); =20 - ASSERT(!hd->arch.vtd.pgd_maddr); + ASSERT(!ctx->arch.vtd.pgd_maddr); =20 for_each_drhd_unit ( drhd ) cleanup_domid_map(d->domain_id, drhd->iommu); - - XFREE(hd->arch.vtd.iommu_bitmap); -} - -static void quarantine_teardown(struct pci_dev *pdev, - const struct acpi_drhd_unit *drhd) -{ - struct domain_iommu *hd =3D dom_iommu(dom_io); - - ASSERT(pcidevs_locked()); - - if ( !pdev->arch.vtd.pgd_maddr ) - return; - - ASSERT(page_list_empty(&hd->arch.pgtables.list)); - page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list); - while ( iommu_free_pgtables(dom_io) =3D=3D -ERESTART ) - /* nothing */; - pdev->arch.vtd.pgd_maddr =3D 0; - - if ( drhd ) - cleanup_domid_map(pdev->arch.pseudo_domid, drhd->iommu); } =20 static int __must_check cf_check intel_iommu_map_page( struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags, - unsigned int *flush_flags) + unsigned int *flush_flags, struct iommu_context *ctx) { struct domain_iommu *hd =3D dom_iommu(d); struct dma_pte *page, *pte, old, new =3D {}; @@ -2095,32 +1690,28 @@ static int __must_check cf_check intel_iommu_map_pa= ge( PAGE_SIZE_4K); =20 /* Do nothing if VT-d shares EPT page table */ - if ( iommu_use_hap_pt(d) ) + if ( iommu_use_hap_pt(d) && !ctx->id ) return 0; =20 /* Do nothing if hardware domain and iommu supports pass thru. */ - if ( iommu_hwdom_passthrough && is_hardware_domain(d) ) + if ( iommu_hwdom_passthrough && is_hardware_domain(d) && !ctx->id ) return 0; =20 - spin_lock(&hd->arch.mapping_lock); - /* * IOMMU mapping request can be safely ignored when the domain is dyin= g. * - * hd->arch.mapping_lock guarantees that d->is_dying will be observed + * hd->lock guarantees that d->is_dying will be observed * before any page tables are freed (see iommu_free_pgtables()) */ if ( d->is_dying ) { - spin_unlock(&hd->arch.mapping_lock); return 0; } =20 - pg_maddr =3D addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), level, flush= _flags, + pg_maddr =3D addr_to_dma_page_maddr(d, ctx, dfn_to_daddr(dfn), level, = flush_flags, true); if ( pg_maddr < PAGE_SIZE ) { - spin_unlock(&hd->arch.mapping_lock); return -ENOMEM; } =20 @@ -2141,7 +1732,6 @@ static int __must_check cf_check intel_iommu_map_page( =20 if ( !((old.val ^ new.val) & ~DMA_PTE_CONTIG_MASK) ) { - spin_unlock(&hd->arch.mapping_lock); unmap_vtd_domain_page(page); return 0; } @@ -2170,7 +1760,7 @@ static int __must_check cf_check intel_iommu_map_page( new.val &=3D ~(LEVEL_MASK << level_to_offset_bits(level)); dma_set_pte_superpage(new); =20 - pg_maddr =3D addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), ++level, + pg_maddr =3D addr_to_dma_page_maddr(d, ctx, dfn_to_daddr(dfn), ++l= evel, flush_flags, false); BUG_ON(pg_maddr < PAGE_SIZE); =20 @@ -2180,11 +1770,10 @@ static int __must_check cf_check intel_iommu_map_pa= ge( iommu_sync_cache(pte, sizeof(*pte)); =20 *flush_flags |=3D IOMMU_FLUSHF_modified | IOMMU_FLUSHF_all; - iommu_queue_free_pgtable(hd, pg); + iommu_queue_free_pgtable(ctx, pg); perfc_incr(iommu_pt_coalesces); } =20 - spin_unlock(&hd->arch.mapping_lock); unmap_vtd_domain_page(page); =20 *flush_flags |=3D IOMMU_FLUSHF_added; @@ -2193,7 +1782,7 @@ static int __must_check cf_check intel_iommu_map_page( *flush_flags |=3D IOMMU_FLUSHF_modified; =20 if ( IOMMUF_order(flags) && !dma_pte_superpage(old) ) - queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(old)), + queue_free_pt(ctx, maddr_to_mfn(dma_pte_addr(old)), IOMMUF_order(flags) / LEVEL_STRIDE); } =20 @@ -2201,7 +1790,8 @@ static int __must_check cf_check intel_iommu_map_page( } =20 static int __must_check cf_check intel_iommu_unmap_page( - struct domain *d, dfn_t dfn, unsigned int order, unsigned int *flush_f= lags) + struct domain *d, dfn_t dfn, unsigned int order, unsigned int *flush_f= lags, + struct iommu_context *ctx) { struct domain_iommu *hd =3D dom_iommu(d); daddr_t addr =3D dfn_to_daddr(dfn); @@ -2216,28 +1806,23 @@ static int __must_check cf_check intel_iommu_unmap_= page( ASSERT((hd->platform_ops->page_sizes >> order) & PAGE_SIZE_4K); =20 /* Do nothing if VT-d shares EPT page table */ - if ( iommu_use_hap_pt(d) ) + if ( iommu_use_hap_pt(d) && !ctx->id ) return 0; =20 /* Do nothing if hardware domain and iommu supports pass thru. */ if ( iommu_hwdom_passthrough && is_hardware_domain(d) ) return 0; =20 - spin_lock(&hd->arch.mapping_lock); /* get target level pte */ - pg_maddr =3D addr_to_dma_page_maddr(d, addr, level, flush_flags, false= ); + pg_maddr =3D addr_to_dma_page_maddr(d, ctx, addr, level, flush_flags, = false); if ( pg_maddr < PAGE_SIZE ) - { - spin_unlock(&hd->arch.mapping_lock); return pg_maddr ? -ENOMEM : 0; - } =20 page =3D map_vtd_domain_page(pg_maddr); pte =3D &page[address_level_offset(addr, level)]; =20 if ( !dma_pte_present(*pte) ) { - spin_unlock(&hd->arch.mapping_lock); unmap_vtd_domain_page(page); return 0; } @@ -2255,7 +1840,7 @@ static int __must_check cf_check intel_iommu_unmap_pa= ge( =20 unmap_vtd_domain_page(page); =20 - pg_maddr =3D addr_to_dma_page_maddr(d, addr, level, flush_flags, f= alse); + pg_maddr =3D addr_to_dma_page_maddr(d, ctx, addr, level, flush_fla= gs, false); BUG_ON(pg_maddr < PAGE_SIZE); =20 page =3D map_vtd_domain_page(pg_maddr); @@ -2264,42 +1849,38 @@ static int __must_check cf_check intel_iommu_unmap_= page( iommu_sync_cache(pte, sizeof(*pte)); =20 *flush_flags |=3D IOMMU_FLUSHF_all; - iommu_queue_free_pgtable(hd, pg); + iommu_queue_free_pgtable(ctx, pg); perfc_incr(iommu_pt_coalesces); } =20 - spin_unlock(&hd->arch.mapping_lock); - unmap_vtd_domain_page(page); =20 *flush_flags |=3D IOMMU_FLUSHF_modified; =20 if ( order && !dma_pte_superpage(old) ) - queue_free_pt(hd, maddr_to_mfn(dma_pte_addr(old)), + queue_free_pt(ctx, maddr_to_mfn(dma_pte_addr(old)), order / LEVEL_STRIDE); =20 return 0; } =20 static int cf_check intel_iommu_lookup_page( - struct domain *d, dfn_t dfn, mfn_t *mfn, unsigned int *flags) + struct domain *d, dfn_t dfn, mfn_t *mfn, unsigned int *flags, + struct iommu_context *ctx) { - struct domain_iommu *hd =3D dom_iommu(d); uint64_t val; =20 /* * If VT-d shares EPT page table or if the domain is the hardware * domain and iommu_passthrough is set then pass back the dfn. */ - if ( iommu_use_hap_pt(d) || - (iommu_hwdom_passthrough && is_hardware_domain(d)) ) + if ( (iommu_use_hap_pt(d) || + (iommu_hwdom_passthrough && is_hardware_domain(d))) + && !ctx->id ) return -EOPNOTSUPP; =20 - spin_lock(&hd->arch.mapping_lock); - - val =3D addr_to_dma_page_maddr(d, dfn_to_daddr(dfn), 0, NULL, false); =20 - spin_unlock(&hd->arch.mapping_lock); + val =3D addr_to_dma_page_maddr(d, ctx, dfn_to_daddr(dfn), 0, NULL, fal= se); =20 if ( val < PAGE_SIZE ) return -ENOENT; @@ -2320,7 +1901,7 @@ static bool __init vtd_ept_page_compatible(const stru= ct vtd_iommu *iommu) =20 /* EPT is not initialised yet, so we must check the capability in * the MSR explicitly rather than use cpu_has_vmx_ept_*() */ - if ( rdmsr_safe(MSR_IA32_VMX_EPT_VPID_CAP, ept_cap) !=3D 0 )=20 + if ( rdmsr_safe(MSR_IA32_VMX_EPT_VPID_CAP, ept_cap) !=3D 0 ) return false; =20 return (ept_has_2mb(ept_cap) && opt_hap_2mb) <=3D @@ -2329,44 +1910,6 @@ static bool __init vtd_ept_page_compatible(const str= uct vtd_iommu *iommu) (cap_sps_1gb(vtd_cap) && iommu_superpages); } =20 -static int cf_check intel_iommu_add_device(u8 devfn, struct pci_dev *pdev) -{ - struct acpi_rmrr_unit *rmrr; - u16 bdf; - int ret, i; - - ASSERT(pcidevs_locked()); - - if ( !pdev->domain ) - return -EINVAL; - - for_each_rmrr_device ( rmrr, bdf, i ) - { - if ( rmrr->segment =3D=3D pdev->seg && bdf =3D=3D PCI_BDF(pdev->bu= s, devfn) ) - { - /* - * iommu_add_device() is only called for the hardware - * domain (see xen/drivers/passthrough/pci.c:pci_add_device()). - * Since RMRRs are always reserved in the e820 map for the har= dware - * domain, there shouldn't be a conflict. - */ - ret =3D iommu_identity_mapping(pdev->domain, p2m_access_rw, - rmrr->base_address, rmrr->end_add= ress, - 0); - if ( ret ) - dprintk(XENLOG_ERR VTDPREFIX, "%pd: RMRR mapping failed\n", - pdev->domain); - } - } - - ret =3D domain_context_mapping(pdev->domain, devfn, pdev); - if ( ret ) - dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n", - pdev->domain); - - return ret; -} - static int cf_check intel_iommu_enable_device(struct pci_dev *pdev) { struct acpi_drhd_unit *drhd =3D acpi_find_matched_drhd_unit(pdev); @@ -2382,49 +1925,16 @@ static int cf_check intel_iommu_enable_device(struc= t pci_dev *pdev) return ret >=3D 0 ? 0 : ret; } =20 -static int cf_check intel_iommu_remove_device(u8 devfn, struct pci_dev *pd= ev) -{ - const struct acpi_drhd_unit *drhd; - struct acpi_rmrr_unit *rmrr; - u16 bdf; - unsigned int i; - - if ( !pdev->domain ) - return -EINVAL; - - drhd =3D domain_context_unmap(pdev->domain, devfn, pdev); - if ( IS_ERR(drhd) ) - return PTR_ERR(drhd); - - for_each_rmrr_device ( rmrr, bdf, i ) - { - if ( rmrr->segment !=3D pdev->seg || bdf !=3D PCI_BDF(pdev->bus, d= evfn) ) - continue; - - /* - * Any flag is nothing to clear these mappings but here - * its always safe and strict to set 0. - */ - iommu_identity_mapping(pdev->domain, p2m_access_x, rmrr->base_addr= ess, - rmrr->end_address, 0); - } - - quarantine_teardown(pdev, drhd); - - if ( drhd ) - { - iommu_free_domid(pdev->arch.pseudo_domid, - drhd->iommu->pseudo_domid_map); - pdev->arch.pseudo_domid =3D DOMID_INVALID; - } - - return 0; -} - static int __hwdom_init cf_check setup_hwdom_device( u8 devfn, struct pci_dev *pdev) { - return domain_context_mapping(pdev->domain, devfn, pdev); + if (pdev->type =3D=3D DEV_TYPE_PCI_HOST_BRIDGE || + pdev->type =3D=3D DEV_TYPE_PCIe_BRIDGE || + pdev->type =3D=3D DEV_TYPE_PCIe2PCI_BRIDGE || + pdev->type =3D=3D DEV_TYPE_LEGACY_PCI_BRIDGE) + return 0; + + return _iommu_attach_context(hardware_domain, pdev, 0); } =20 void clear_fault_bits(struct vtd_iommu *iommu) @@ -2518,7 +2028,7 @@ static int __must_check init_vtd_hw(bool resume) =20 /* * Enable queue invalidation - */ =20 + */ for_each_drhd_unit ( drhd ) { iommu =3D drhd->iommu; @@ -2539,7 +2049,7 @@ static int __must_check init_vtd_hw(bool resume) =20 /* * Enable interrupt remapping - */ =20 + */ if ( iommu_intremap !=3D iommu_intremap_off ) { int apic; @@ -2622,15 +2132,60 @@ static struct iommu_state { uint32_t fectl; } *__read_mostly iommu_state; =20 -static int __init cf_check vtd_setup(void) +static void arch_iommu_dump_domain_contexts(struct domain *d) { - struct acpi_drhd_unit *drhd; - struct vtd_iommu *iommu; - unsigned int large_sizes =3D iommu_superpages ? PAGE_SIZE_2M | PAGE_SI= ZE_1G : 0; - int ret; - bool reg_inval_supported =3D true; + unsigned int i, iommu_no; + struct pci_dev *pdev; + struct iommu_context *ctx; + struct domain_iommu *hd =3D dom_iommu(d); =20 - if ( list_empty(&acpi_drhd_units) ) + printk("d%hu contexts\n", d->domain_id); + + spin_lock(&hd->lock); + + for (i =3D 0; i < (1 + dom_iommu(d)->other_contexts.count); ++i) + { + if (iommu_check_context(d, i)) + { + ctx =3D iommu_get_context(d, i); + printk(" Context %d (%"PRIx64")\n", i, ctx->arch.vtd.pgd_maddr= ); + + for (iommu_no =3D 0; iommu_no < nr_iommus; iommu_no++) + printk(" IOMMU %hu (used=3D%u; did=3D%hu)\n", iommu_no, + test_bit(iommu_no, ctx->arch.vtd.iommu_bitmap), + ctx->arch.vtd.didmap[iommu_no]); + + list_for_each_entry(pdev, &ctx->devices, context_list) + { + printk(" - %pp\n", &pdev->sbdf); + } + } + } + + spin_unlock(&hd->lock); +} + +static void arch_iommu_dump_contexts(unsigned char key) +{ + struct domain *d; + + for_each_domain(d) { + struct domain_iommu *hd =3D dom_iommu(d); + printk("d%hu arena page usage: %d\n", d->domain_id, + atomic_read(&hd->arch.pt_arena.used_pages)); + + arch_iommu_dump_domain_contexts(d); + } +} +static int __init cf_check vtd_setup(void) +{ + struct acpi_drhd_unit *drhd; + struct vtd_iommu *iommu; + unsigned int large_sizes =3D iommu_superpages ? PAGE_SIZE_2M | PAGE_SI= ZE_1G : 0; + int ret; + bool reg_inval_supported =3D true; + + if ( list_empty(&acpi_drhd_units) ) { ret =3D -ENODEV; goto error; @@ -2749,6 +2304,7 @@ static int __init cf_check vtd_setup(void) iommu_ops.page_sizes |=3D large_sizes; =20 register_keyhandler('V', vtd_dump_iommu_info, "dump iommu info", 1); + register_keyhandler('X', arch_iommu_dump_contexts, "dump iommu context= s", 1); =20 return 0; =20 @@ -2763,192 +2319,6 @@ static int __init cf_check vtd_setup(void) return ret; } =20 -static int cf_check reassign_device_ownership( - struct domain *source, - struct domain *target, - u8 devfn, struct pci_dev *pdev) -{ - int ret; - - if ( !QUARANTINE_SKIP(target, pdev->arch.vtd.pgd_maddr) ) - { - if ( !has_arch_pdevs(target) ) - vmx_pi_hooks_assign(target); - -#ifdef CONFIG_PV - /* - * Devices assigned to untrusted domains (here assumed to be any d= omU) - * can attempt to send arbitrary LAPIC/MSI messages. We are unprot= ected - * by the root complex unless interrupt remapping is enabled. - */ - if ( !iommu_intremap && !is_hardware_domain(target) && - !is_system_domain(target) ) - untrusted_msi =3D true; -#endif - - ret =3D domain_context_mapping(target, devfn, pdev); - - if ( !ret && pdev->devfn =3D=3D devfn && - !QUARANTINE_SKIP(source, pdev->arch.vtd.pgd_maddr) ) - { - const struct acpi_drhd_unit *drhd =3D acpi_find_matched_drhd_u= nit(pdev); - - if ( drhd ) - check_cleanup_domid_map(source, pdev, drhd->iommu); - } - } - else - { - const struct acpi_drhd_unit *drhd; - - drhd =3D domain_context_unmap(source, devfn, pdev); - ret =3D IS_ERR(drhd) ? PTR_ERR(drhd) : 0; - } - if ( ret ) - { - if ( !has_arch_pdevs(target) ) - vmx_pi_hooks_deassign(target); - return ret; - } - - if ( devfn =3D=3D pdev->devfn && pdev->domain !=3D target ) - { - write_lock(&source->pci_lock); - list_del(&pdev->domain_list); - write_unlock(&source->pci_lock); - - pdev->domain =3D target; - - write_lock(&target->pci_lock); - list_add(&pdev->domain_list, &target->pdev_list); - write_unlock(&target->pci_lock); - } - - if ( !has_arch_pdevs(source) ) - vmx_pi_hooks_deassign(source); - - /* - * If the device belongs to the hardware domain, and it has RMRR, don't - * remove it from the hardware domain, because BIOS may use RMRR at - * booting time. - */ - if ( !is_hardware_domain(source) ) - { - const struct acpi_rmrr_unit *rmrr; - u16 bdf; - unsigned int i; - - for_each_rmrr_device( rmrr, bdf, i ) - if ( rmrr->segment =3D=3D pdev->seg && - bdf =3D=3D PCI_BDF(pdev->bus, devfn) ) - { - /* - * Any RMRR flag is always ignored when remove a device, - * but its always safe and strict to set 0. - */ - ret =3D iommu_identity_mapping(source, p2m_access_x, - rmrr->base_address, - rmrr->end_address, 0); - if ( ret && ret !=3D -ENOENT ) - return ret; - } - } - - return 0; -} - -static int cf_check intel_iommu_assign_device( - struct domain *d, u8 devfn, struct pci_dev *pdev, u32 flag) -{ - struct domain *s =3D pdev->domain; - struct acpi_rmrr_unit *rmrr; - int ret =3D 0, i; - u16 bdf, seg; - u8 bus; - - if ( list_empty(&acpi_drhd_units) ) - return -ENODEV; - - seg =3D pdev->seg; - bus =3D pdev->bus; - /* - * In rare cases one given rmrr is shared by multiple devices but - * obviously this would put the security of a system at risk. So - * we would prevent from this sort of device assignment. But this - * can be permitted if user set - * "pci =3D [ 'sbdf, rdm_policy=3Drelaxed' ]" - * - * TODO: in the future we can introduce group device assignment - * interface to make sure devices sharing RMRR are assigned to the - * same domain together. - */ - for_each_rmrr_device( rmrr, bdf, i ) - { - if ( rmrr->segment =3D=3D seg && bdf =3D=3D PCI_BDF(bus, devfn) && - rmrr->scope.devices_cnt > 1 ) - { - bool relaxed =3D flag & XEN_DOMCTL_DEV_RDM_RELAXED; - - printk(XENLOG_GUEST "%s" VTDPREFIX - " It's %s to assign %pp" - " with shared RMRR at %"PRIx64" for %pd.\n", - relaxed ? XENLOG_WARNING : XENLOG_ERR, - relaxed ? "risky" : "disallowed", - &PCI_SBDF(seg, bus, devfn), rmrr->base_address, d); - if ( !relaxed ) - return -EPERM; - } - } - - if ( d =3D=3D dom_io ) - return reassign_device_ownership(s, d, devfn, pdev); - - /* Setup rmrr identity mapping */ - for_each_rmrr_device( rmrr, bdf, i ) - { - if ( rmrr->segment =3D=3D seg && bdf =3D=3D PCI_BDF(bus, devfn) ) - { - ret =3D iommu_identity_mapping(d, p2m_access_rw, rmrr->base_ad= dress, - rmrr->end_address, flag); - if ( ret ) - { - printk(XENLOG_G_ERR VTDPREFIX - "%pd: cannot map reserved region [%"PRIx64",%"PRIx6= 4"]: %d\n", - d, rmrr->base_address, rmrr->end_address, ret); - break; - } - } - } - - if ( !ret ) - ret =3D reassign_device_ownership(s, d, devfn, pdev); - - /* See reassign_device_ownership() for the hwdom aspect. */ - if ( !ret || is_hardware_domain(d) ) - return ret; - - for_each_rmrr_device( rmrr, bdf, i ) - { - if ( rmrr->segment =3D=3D seg && bdf =3D=3D PCI_BDF(bus, devfn) ) - { - int rc =3D iommu_identity_mapping(d, p2m_access_x, - rmrr->base_address, - rmrr->end_address, 0); - - if ( rc && rc !=3D -ENOENT ) - { - printk(XENLOG_ERR VTDPREFIX - "%pd: cannot unmap reserved region [%"PRIx64",%"PRI= x64"]: %d\n", - d, rmrr->base_address, rmrr->end_address, rc); - domain_crash(d); - break; - } - } - } - - return ret; -} - static int cf_check intel_iommu_group_id(u16 seg, u8 bus, u8 devfn) { u8 secbus; @@ -3073,6 +2443,11 @@ static void vtd_dump_page_table_level(paddr_t pt_mad= dr, int level, paddr_t gpa, if ( level < 1 ) return; =20 + if (pt_maddr =3D=3D 0) { + printk(" (empty)\n"); + return; + } + pt_vaddr =3D map_vtd_domain_page(pt_maddr); =20 next_level =3D level - 1; @@ -3103,158 +2478,478 @@ static void vtd_dump_page_table_level(paddr_t pt_= maddr, int level, paddr_t gpa, =20 static void cf_check vtd_dump_page_tables(struct domain *d) { - const struct domain_iommu *hd =3D dom_iommu(d); + struct domain_iommu *hd =3D dom_iommu(d); + unsigned int i; =20 printk(VTDPREFIX" %pd table has %d levels\n", d, agaw_to_level(hd->arch.vtd.agaw)); - vtd_dump_page_table_level(hd->arch.vtd.pgd_maddr, - agaw_to_level(hd->arch.vtd.agaw), 0, 0); + + for (i =3D 1; i < (1 + hd->other_contexts.count); ++i) + { + bool allocated =3D iommu_check_context(d, i); + printk(VTDPREFIX " %pd context %d: %s\n", d, i, + allocated ? "allocated" : "non-allocated"); + + if (allocated) { + const struct iommu_context *ctx =3D iommu_get_context(d, i); + vtd_dump_page_table_level(ctx->arch.vtd.pgd_maddr, + agaw_to_level(hd->arch.vtd.agaw), 0,= 0); + } + } } =20 -static int fill_qpt(struct dma_pte *this, unsigned int level, - struct page_info *pgs[6]) +static int intel_iommu_context_init(struct domain *d, struct iommu_context= *ctx, u32 flags) { - struct domain_iommu *hd =3D dom_iommu(dom_io); - unsigned int i; - int rc =3D 0; + struct acpi_drhd_unit *drhd; =20 - for ( i =3D 0; !rc && i < PTE_NUM; ++i ) + ctx->arch.vtd.didmap =3D xzalloc_array(u16, nr_iommus); + + if ( !ctx->arch.vtd.didmap ) + return -ENOMEM; + + ctx->arch.vtd.iommu_bitmap =3D xzalloc_array(unsigned long, + BITS_TO_LONGS(nr_iommus)); + if ( !ctx->arch.vtd.iommu_bitmap ) + return -ENOMEM; + + ctx->arch.vtd.duplicated_rmrr =3D false; + ctx->arch.vtd.superpage_progress =3D 0; + + if ( flags & IOMMU_CONTEXT_INIT_default ) { - struct dma_pte *pte =3D &this[i], *next; + ctx->arch.vtd.pgd_maddr =3D 0; =20 - if ( !dma_pte_present(*pte) ) + /* Populate context DID map using domain id. */ + for_each_drhd_unit(drhd) { - if ( !pgs[level] ) - { - /* - * The pgtable allocator is fine for the leaf page, as wel= l as - * page table pages, and the resulting allocations are alw= ays - * zeroed. - */ - pgs[level] =3D iommu_alloc_pgtable(hd, 0); - if ( !pgs[level] ) - { - rc =3D -ENOMEM; - break; - } - - if ( level ) - { - next =3D map_vtd_domain_page(page_to_maddr(pgs[level])= ); - rc =3D fill_qpt(next, level - 1, pgs); - unmap_vtd_domain_page(next); - } - } + ctx->arch.vtd.didmap[drhd->iommu->index] =3D + convert_domid(drhd->iommu, d->domain_id); + } + } + else + { + /* Populate context DID map using pseudo DIDs */ + for_each_drhd_unit(drhd) + { + ctx->arch.vtd.didmap[drhd->iommu->index] =3D + iommu_alloc_domid(drhd->iommu->pseudo_domid_map); + } + + /* Create initial context page */ + addr_to_dma_page_maddr(d, ctx, 0, min_pt_levels, NULL, true); + } =20 - dma_set_pte_addr(*pte, page_to_maddr(pgs[level])); - dma_set_pte_readable(*pte); - dma_set_pte_writable(*pte); + return arch_iommu_context_init(d, ctx, flags); +} + +static int intel_iommu_cleanup_pte(uint64_t pte_maddr, bool preempt) +{ + size_t i; + struct dma_pte *pte =3D map_vtd_domain_page(pte_maddr); + + for (i =3D 0; i < (1 << PAGETABLE_ORDER); ++i) + if ( dma_pte_present(pte[i]) ) + { + /* Remove the reference of the target mapping */ + put_page(maddr_to_page(dma_pte_addr(pte[i]))); + + if ( preempt ) + dma_clear_pte(pte[i]); } - else if ( level && !dma_pte_superpage(*pte) ) + + unmap_vtd_domain_page(pte); + + return 0; +} + +/** + * Cleanup logic : + * Walk through the entire page table, progressively removing mappings if = preempt. + * + * Return values : + * - Report preemption with -ERESTART. + * - Report empty pte/pgd with 0. + * + * When preempted during superpage operation, store state in vtd.superpage= _progress. + */ + +static int intel_iommu_cleanup_superpage(struct iommu_context *ctx, + unsigned int page_order, uint64_= t pte_maddr, + bool preempt) +{ + size_t i =3D 0, page_count =3D 1 << page_order; + struct page_info *page =3D maddr_to_page(pte_maddr); + + if ( preempt ) + i =3D ctx->arch.vtd.superpage_progress; + + for (; i < page_count; page++) + { + put_page(page); + + if ( preempt && (i & 0xff) && general_preempt_check() ) { - next =3D map_vtd_domain_page(dma_pte_addr(*pte)); - rc =3D fill_qpt(next, level - 1, pgs); - unmap_vtd_domain_page(next); + ctx->arch.vtd.superpage_progress =3D i + 1; + return -ERESTART; } } =20 - return rc; + if ( preempt ) + ctx->arch.vtd.superpage_progress =3D 0; + + return 0; } =20 -static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev, - bool scratch_page) +static int intel_iommu_cleanup_mappings(struct iommu_context *ctx, + unsigned int nr_pt_levels, uint64= _t pgd_maddr, + bool preempt) { - struct domain_iommu *hd =3D dom_iommu(dom_io); - struct page_info *pg; - unsigned int agaw =3D hd->arch.vtd.agaw; - unsigned int level =3D agaw_to_level(agaw); - const struct acpi_drhd_unit *drhd; - const struct acpi_rmrr_unit *rmrr; - unsigned int i, bdf; - bool rmrr_found =3D false; + size_t i; int rc; + struct dma_pte *pgd =3D map_vtd_domain_page(pgd_maddr); =20 - ASSERT(pcidevs_locked()); - ASSERT(!hd->arch.vtd.pgd_maddr); - ASSERT(page_list_empty(&hd->arch.pgtables.list)); + for (i =3D 0; i < (1 << PAGETABLE_ORDER); ++i) + { + if ( dma_pte_present(pgd[i]) ) + { + uint64_t pte_maddr =3D dma_pte_addr(pgd[i]); + + if ( dma_pte_superpage(pgd[i]) ) + rc =3D intel_iommu_cleanup_superpage(ctx, nr_pt_levels * S= UPERPAGE_ORDER, + pte_maddr, preempt); + else if ( nr_pt_levels > 2 ) + /* Next level is not PTE */ + rc =3D intel_iommu_cleanup_mappings(ctx, nr_pt_levels - 1, + pte_maddr, preempt); + else + rc =3D intel_iommu_cleanup_pte(pte_maddr, preempt); + + if ( preempt && !rc ) + /* Fold pgd (no more mappings in it) */ + dma_clear_pte(pgd[i]); + else if ( preempt && (rc =3D=3D -ERESTART || general_preempt_c= heck()) ) + { + unmap_vtd_domain_page(pgd); + return -ERESTART; + } + } + } =20 - if ( pdev->arch.vtd.pgd_maddr ) + unmap_vtd_domain_page(pgd); + + return 0; +} + +static int intel_iommu_context_teardown(struct domain *d, struct iommu_con= text *ctx, u32 flags) +{ + struct acpi_drhd_unit *drhd; + pcidevs_lock(); + + // Cleanup mappings + if ( intel_iommu_cleanup_mappings(ctx, agaw_to_level(d->iommu.arch.vtd= .agaw), + ctx->arch.vtd.pgd_maddr, + flags & IOMMUF_preempt) < 0 ) { - clear_domain_page(pdev->arch.leaf_mfn); - return 0; + pcidevs_unlock(); + return -ERESTART; } =20 - drhd =3D acpi_find_matched_drhd_unit(pdev); - if ( !drhd ) - return -ENODEV; + if (ctx->arch.vtd.didmap) + { + for_each_drhd_unit(drhd) + { + iommu_free_domid(ctx->arch.vtd.didmap[drhd->iommu->index], + drhd->iommu->pseudo_domid_map); + } =20 - pg =3D iommu_alloc_pgtable(hd, 0); - if ( !pg ) - return -ENOMEM; + xfree(ctx->arch.vtd.didmap); + } =20 - rc =3D context_set_domain_id(NULL, pdev->arch.pseudo_domid, drhd->iomm= u); + pcidevs_unlock(); + return arch_iommu_context_teardown(d, ctx, flags); +} =20 - /* Transiently install the root into DomIO, for iommu_identity_mapping= (). */ - hd->arch.vtd.pgd_maddr =3D page_to_maddr(pg); +static int intel_iommu_map_identity(struct domain *d, struct pci_dev *pdev, + struct iommu_context *ctx, struct acpi= _rmrr_unit *rmrr) +{ + /* TODO: This code doesn't cleanup on failure */ =20 - for_each_rmrr_device ( rmrr, bdf, i ) + int ret =3D 0, rc =3D 0; + unsigned int flush_flags =3D 0, flags; + u64 base_pfn =3D rmrr->base_address >> PAGE_SHIFT_4K; + u64 end_pfn =3D PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K; + u64 pfn =3D base_pfn; + + printk(XENLOG_INFO VTDPREFIX + " Mapping d%dc%d device %pp identity mapping [%08" PRIx64 ":%0= 8" PRIx64 "]\n", + d->domain_id, ctx->id, &pdev->sbdf, rmrr->base_address, rmrr->= end_address); + + ASSERT(end_pfn >=3D base_pfn); + + while (pfn < end_pfn) { - if ( rc ) - break; + mfn_t mfn =3D INVALID_MFN; + ret =3D intel_iommu_lookup_page(d, _dfn(pfn), &mfn, &flags, ctx); =20 - if ( rmrr->segment =3D=3D pdev->seg && bdf =3D=3D pdev->sbdf.bdf ) + if ( ret =3D=3D -ENOENT ) { - rmrr_found =3D true; + ret =3D intel_iommu_map_page(d, _dfn(pfn), _mfn(pfn), + IOMMUF_readable | IOMMUF_writable, + &flush_flags, ctx); =20 - rc =3D iommu_identity_mapping(dom_io, p2m_access_rw, - rmrr->base_address, rmrr->end_addr= ess, - 0); - if ( rc ) + if ( ret < 0 ) + { printk(XENLOG_ERR VTDPREFIX - "%pp: RMRR quarantine mapping failed\n", - &pdev->sbdf); + " Unable to map RMRR page %"PRI_pfn" (%d)\n", + pfn, ret); + break; + } + } + else if ( ret ) + { + printk(XENLOG_ERR VTDPREFIX + " Unable to lookup page %"PRI_pfn" (%d)\n", + pfn, ret); + break; } + else if ( mfn_x(mfn) !=3D pfn ) + { + /* The dfn is already mapped to something else, can't continue= . */ + printk(XENLOG_ERR VTDPREFIX + " Unable to map RMRR page %"PRI_mfn"!=3D%"PRI_pfn" (inc= ompatible mapping)\n", + mfn_x(mfn), pfn); + + ret =3D -EINVAL; + break; + } + else if ( mfn_x(mfn) =3D=3D pfn ) + { + /* + * There is already a identity mapping in this context, we nee= d to + * be extra-careful when dettaching the device to not break an= other + * existing RMRR. + */ + printk(XENLOG_WARNING VTDPREFIX + "Duplicated RMRR mapping %"PRI_pfn"\n", pfn); + + ctx->arch.vtd.duplicated_rmrr =3D true; + } + + pfn++; } =20 - iommu_identity_map_teardown(dom_io); - hd->arch.vtd.pgd_maddr =3D 0; - pdev->arch.vtd.pgd_maddr =3D page_to_maddr(pg); + rc =3D iommu_flush_iotlb(d, ctx, _dfn(base_pfn), end_pfn - base_pfn + = 1, flush_flags); =20 - if ( !rc && scratch_page ) + return ret ?: rc; +} + +static int intel_iommu_map_dev_rmrr(struct domain *d, struct pci_dev *pdev, + struct iommu_context *ctx) +{ + struct acpi_rmrr_unit *rmrr; + u16 bdf; + int ret, i; + + for_each_rmrr_device(rmrr, bdf, i) { - struct dma_pte *root; - struct page_info *pgs[6] =3D {}; + if ( PCI_SBDF(rmrr->segment, bdf).sbdf =3D=3D pdev->sbdf.sbdf ) + { + ret =3D intel_iommu_map_identity(d, pdev, ctx, rmrr); =20 - root =3D map_vtd_domain_page(pdev->arch.vtd.pgd_maddr); - rc =3D fill_qpt(root, level - 1, pgs); - unmap_vtd_domain_page(root); + if ( ret < 0 ) + return ret; + } + } =20 - pdev->arch.leaf_mfn =3D page_to_mfn(pgs[0]); + return 0; +} + +static int intel_iommu_unmap_identity(struct domain *d, struct pci_dev *pd= ev, + struct iommu_context *ctx, struct ac= pi_rmrr_unit *rmrr) +{ + /* TODO: This code doesn't cleanup on failure */ + + int ret =3D 0, rc =3D 0; + unsigned int flush_flags =3D 0; + u64 base_pfn =3D rmrr->base_address >> PAGE_SHIFT_4K; + u64 end_pfn =3D PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K; + u64 pfn =3D base_pfn; + + printk(VTDPREFIX + " Unmapping d%dc%d device %pp identity mapping [%08" PRIx64 ":= %08" PRIx64 "]\n", + d->domain_id, ctx->id, &pdev->sbdf, rmrr->base_address, rmrr->= end_address); + + ASSERT(end_pfn >=3D base_pfn); + + while (pfn < end_pfn) + { + ret =3D intel_iommu_unmap_page(d, _dfn(pfn), PAGE_ORDER_4K, &flush= _flags, ctx); + + if ( ret ) + break; + + pfn++; } =20 - page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list); + rc =3D iommu_flush_iotlb(d, ctx, _dfn(base_pfn), end_pfn - base_pfn + = 1, flush_flags); =20 - if ( rc || (!scratch_page && !rmrr_found) ) - quarantine_teardown(pdev, drhd); + return ret ?: rc; +} =20 - return rc; +/* Check if another overlapping rmrr exist for another device of the conte= xt */ +static bool intel_iommu_check_duplicate(struct domain *d, struct pci_dev *= pdev, + struct iommu_context *ctx, + struct acpi_rmrr_unit *rmrr) +{ + struct acpi_rmrr_unit *other_rmrr; + u16 bdf; + int i; + + for_each_rmrr_device(other_rmrr, bdf, i) + { + if (rmrr =3D=3D other_rmrr) + continue; + + /* Skip RMRR entries of the same device */ + if ( PCI_SBDF(rmrr->segment, bdf).sbdf =3D=3D pdev->sbdf.sbdf ) + continue; + + /* Check for overlap */ + if ( rmrr->base_address >=3D other_rmrr->base_address + && rmrr->end_address <=3D other_rmrr->end_address ) + return true; + + if ( other_rmrr->base_address >=3D rmrr->base_address + && other_rmrr->end_address <=3D rmrr->end_address ) + return true; + } + + return false; +} + +static int intel_iommu_unmap_dev_rmrr(struct domain *d, struct pci_dev *pd= ev, + struct iommu_context *ctx) +{ + struct acpi_rmrr_unit *rmrr; + u16 bdf; + int ret, i; + + for_each_rmrr_device(rmrr, bdf, i) + { + if ( PCI_SBDF(rmrr->segment, bdf).sbdf =3D=3D pdev->sbdf.sbdf ) + { + if ( ctx->arch.vtd.duplicated_rmrr + && intel_iommu_check_duplicate(d, pdev, ctx, rmrr) ) + continue; + + ret =3D intel_iommu_unmap_identity(d, pdev, ctx, rmrr); + + if ( ret < 0 ) + return ret; + } + } + + return 0; +} + +static int intel_iommu_attach(struct domain *d, struct pci_dev *pdev, + struct iommu_context *ctx) +{ + int ret; + const struct acpi_drhd_unit *drhd =3D acpi_find_matched_drhd_unit(pdev= ); + + if (!pdev || !drhd) + return -EINVAL; + + if ( ctx->id ) + { + ret =3D intel_iommu_map_dev_rmrr(d, pdev, ctx); + + if ( ret ) + return ret; + } + + ret =3D apply_context(d, ctx, pdev, pdev->devfn); + + if ( ret ) + return ret; + + pci_vtd_quirk(pdev); + + return ret; +} + +static int intel_iommu_dettach(struct domain *d, struct pci_dev *pdev, + struct iommu_context *prev_ctx) +{ + int ret; + const struct acpi_drhd_unit *drhd =3D acpi_find_matched_drhd_unit(pdev= ); + + if (!pdev || !drhd) + return -EINVAL; + + ret =3D unapply_context_single(d, prev_ctx, drhd->iommu, pdev->bus, pd= ev->devfn); + + if ( ret ) + return ret; + + if ( prev_ctx->id ) + WARN_ON(intel_iommu_unmap_dev_rmrr(d, pdev, prev_ctx)); + + check_cleanup_domid_map(d, prev_ctx, NULL, drhd->iommu); + + return ret; +} + +static int intel_iommu_reattach(struct domain *d, struct pci_dev *pdev, + struct iommu_context *prev_ctx, + struct iommu_context *ctx) +{ + int ret; + const struct acpi_drhd_unit *drhd =3D acpi_find_matched_drhd_unit(pdev= ); + + if (!pdev || !drhd) + return -EINVAL; + + if ( ctx->id ) + { + ret =3D intel_iommu_map_dev_rmrr(d, pdev, ctx); + + if ( ret ) + return ret; + } + + ret =3D apply_context_single(d, ctx, drhd->iommu, pdev->bus, pdev->dev= fn); + + if ( ret ) + return ret; + + if ( prev_ctx->id ) + WARN_ON(intel_iommu_unmap_dev_rmrr(d, pdev, prev_ctx)); + + /* We are overwriting an entry, cleanup previous domid if needed. */ + check_cleanup_domid_map(d, prev_ctx, pdev, drhd->iommu); + + pci_vtd_quirk(pdev); + + return ret; } =20 static const struct iommu_ops __initconst_cf_clobber vtd_ops =3D { .page_sizes =3D PAGE_SIZE_4K, .init =3D intel_iommu_domain_init, .hwdom_init =3D intel_iommu_hwdom_init, - .quarantine_init =3D intel_iommu_quarantine_init, - .add_device =3D intel_iommu_add_device, + .context_init =3D intel_iommu_context_init, + .context_teardown =3D intel_iommu_context_teardown, + .attach =3D intel_iommu_attach, + .dettach =3D intel_iommu_dettach, + .reattach =3D intel_iommu_reattach, .enable_device =3D intel_iommu_enable_device, - .remove_device =3D intel_iommu_remove_device, - .assign_device =3D intel_iommu_assign_device, .teardown =3D iommu_domain_teardown, .clear_root_pgtable =3D iommu_clear_root_pgtable, .map_page =3D intel_iommu_map_page, .unmap_page =3D intel_iommu_unmap_page, .lookup_page =3D intel_iommu_lookup_page, - .reassign_device =3D reassign_device_ownership, .get_device_group_id =3D intel_iommu_group_id, .enable_x2apic =3D intel_iommu_enable_eim, .disable_x2apic =3D intel_iommu_disable_eim, diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough= /vtd/quirks.c index 950dcd56ef..719985f885 100644 --- a/xen/drivers/passthrough/vtd/quirks.c +++ b/xen/drivers/passthrough/vtd/quirks.c @@ -408,9 +408,8 @@ void __init platform_quirks_init(void) =20 static int __must_check map_me_phantom_function(struct domain *domain, unsigned int dev, - domid_t domid, - paddr_t pgd_maddr, - unsigned int mode) + unsigned int mode, + struct iommu_context *ctx) { struct acpi_drhd_unit *drhd; struct pci_dev *pdev; @@ -422,18 +421,18 @@ static int __must_check map_me_phantom_function(struc= t domain *domain, =20 /* map or unmap ME phantom function */ if ( !(mode & UNMAP_ME_PHANTOM_FUNC) ) - rc =3D domain_context_mapping_one(domain, drhd->iommu, 0, - PCI_DEVFN(dev, 7), NULL, - domid, pgd_maddr, mode); + rc =3D apply_context_single(domain, ctx, drhd->iommu, 0, + PCI_DEVFN(dev, 7)); else - rc =3D domain_context_unmap_one(domain, drhd->iommu, 0, - PCI_DEVFN(dev, 7)); + rc =3D unapply_context_single(domain, ctx, drhd->iommu, 0, + PCI_DEVFN(dev, 7)); =20 return rc; } =20 int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn, - domid_t domid, paddr_t pgd_maddr, unsigned int mode) + domid_t domid, unsigned int mode, + struct iommu_context *ctx) { u32 id; int rc =3D 0; @@ -457,7 +456,7 @@ int me_wifi_quirk(struct domain *domain, uint8_t bus, u= int8_t devfn, case 0x423b8086: case 0x423c8086: case 0x423d8086: - rc =3D map_me_phantom_function(domain, 3, domid, pgd_maddr= , mode); + rc =3D map_me_phantom_function(domain, 3, mode, ctx); break; default: break; @@ -483,7 +482,7 @@ int me_wifi_quirk(struct domain *domain, uint8_t bus, u= int8_t devfn, case 0x42388086: /* Puma Peak */ case 0x422b8086: case 0x422c8086: - rc =3D map_me_phantom_function(domain, 22, domid, pgd_madd= r, mode); + rc =3D map_me_phantom_function(domain, 22, mode, ctx); break; default: break; diff --git a/xen/drivers/passthrough/x86/Makefile b/xen/drivers/passthrough= /x86/Makefile index 75b2885336..1614f3d284 100644 --- a/xen/drivers/passthrough/x86/Makefile +++ b/xen/drivers/passthrough/x86/Makefile @@ -1,2 +1,3 @@ obj-y +=3D iommu.o +obj-y +=3D arena.o obj-$(CONFIG_HVM) +=3D hvm.o diff --git a/xen/drivers/passthrough/x86/arena.c b/xen/drivers/passthrough/= x86/arena.c new file mode 100644 index 0000000000..984bc4d643 --- /dev/null +++ b/xen/drivers/passthrough/x86/arena.c @@ -0,0 +1,157 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/** + * Simple arena-based page allocator. + * + * Allocate a large block using alloc_domheam_pages and allocate single pa= ges + * using iommu_arena_allocate_page and iommu_arena_free_page functions. + * + * Concurrent {allocate/free}_page is thread-safe + * iommu_arena_teardown during {allocate/free}_page is not thread-safe. + * + * Written by Teddy Astie + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +/* Maximum of scan tries if the bit found not available */ +#define ARENA_TSL_MAX_TRIES 5 + +int iommu_arena_initialize(struct iommu_arena *arena, struct domain *d, + unsigned int order, unsigned int memflags) +{ + struct page_info *page; + + /* TODO: Maybe allocate differently ? */ + page =3D alloc_domheap_pages(d, order, memflags); + + if ( !page ) + return -ENOMEM; + + arena->map =3D xzalloc_array(unsigned long, BITS_TO_LONGS(1LLU << orde= r)); + arena->order =3D order; + arena->region_start =3D page_to_mfn(page); + + _atomic_set(&arena->used_pages, 0); + bitmap_zero(arena->map, iommu_arena_size(arena)); + + printk(XENLOG_DEBUG "IOMMU: Allocated arena (%llu pages, start=3D%"PRI= _mfn")\n", + iommu_arena_size(arena), mfn_x(arena->region_start)); + return 0; +} + +int iommu_arena_teardown(struct iommu_arena *arena, bool check) +{ + BUG_ON(mfn_x(arena->region_start) =3D=3D 0); + + /* Check for allocations if check is specified */ + if ( check && (atomic_read(&arena->used_pages) > 0) ) + return -EBUSY; + + free_domheap_pages(mfn_to_page(arena->region_start), arena->order); + + arena->region_start =3D _mfn(0); + _atomic_set(&arena->used_pages, 0); + xfree(arena->map); + arena->map =3D NULL; + + return 0; +} + +struct page_info *iommu_arena_allocate_page(struct iommu_arena *arena) +{ + unsigned int index; + unsigned int tsl_tries =3D 0; + + BUG_ON(mfn_x(arena->region_start) =3D=3D 0); + + if ( atomic_read(&arena->used_pages) =3D=3D iommu_arena_size(arena) ) + /* All pages used */ + return NULL; + + do + { + index =3D find_first_zero_bit(arena->map, iommu_arena_size(arena)); + + if ( index >=3D iommu_arena_size(arena) ) + /* No more free pages */ + return NULL; + + /* + * While there shouldn't be a lot of retries in practice, this loop + * *may* run indefinetly if the found bit is never free due to bei= ng + * overwriten by another CPU core right after. Add a safeguard for + * such very rare cases. + */ + tsl_tries++; + + if ( unlikely(tsl_tries =3D=3D ARENA_TSL_MAX_TRIES) ) + { + printk(XENLOG_ERR "ARENA: Too many TSL retries !"); + return NULL; + } + + /* Make sure that the bit we found is still free */ + } while ( test_and_set_bit(index, arena->map) ); + + atomic_inc(&arena->used_pages); + + return mfn_to_page(mfn_add(arena->region_start, index)); +} + +bool iommu_arena_free_page(struct iommu_arena *arena, struct page_info *pa= ge) +{ + unsigned long index; + mfn_t frame; + + if ( !page ) + { + printk(XENLOG_WARNING "IOMMU: Trying to free NULL page"); + WARN(); + return false; + } + + frame =3D page_to_mfn(page); + + /* Check if page belongs to our arena */ + if ( (mfn_x(frame) < mfn_x(arena->region_start)) + || (mfn_x(frame) >=3D (mfn_x(arena->region_start) + iommu_arena_si= ze(arena))) ) + { + printk(XENLOG_WARNING + "IOMMU: Trying to free outside arena region [mfn=3D%"PRI_mf= n"]", + mfn_x(frame)); + WARN(); + return false; + } + + index =3D mfn_x(frame) - mfn_x(arena->region_start); + + /* Sanity check in case of underflow. */ + ASSERT(index < iommu_arena_size(arena)); + + if ( !test_and_clear_bit(index, arena->map) ) + { + /* + * Bit was free during our arena_free_page, which means that + * either this page was never allocated, or we are in a double-free + * situation. + */ + printk(XENLOG_WARNING + "IOMMU: Freeing non-allocated region (double-free?) [mfn=3D= %"PRI_mfn"]", + mfn_x(frame)); + WARN(); + return false; + } + + atomic_dec(&arena->used_pages); + + return true; +} \ No newline at end of file diff --git a/xen/drivers/passthrough/x86/iommu.c b/xen/drivers/passthrough/= x86/iommu.c index a3fa0aef7c..078ed12b0a 100644 --- a/xen/drivers/passthrough/x86/iommu.c +++ b/xen/drivers/passthrough/x86/iommu.c @@ -12,6 +12,13 @@ * this program; If not, see . */ =20 +#include +#include +#include +#include +#include +#include +#include #include #include #include @@ -28,6 +35,9 @@ #include #include #include +#include +#include +#include =20 const struct iommu_init_ops *__initdata iommu_init_ops; struct iommu_ops __ro_after_init iommu_ops; @@ -183,15 +193,42 @@ void __hwdom_init arch_iommu_check_autotranslated_hwd= om(struct domain *d) panic("PVH hardware domain iommu must be set in 'strict' mode\n"); } =20 -int arch_iommu_domain_init(struct domain *d) +int arch_iommu_context_init(struct domain *d, struct iommu_context *ctx, u= 32 flags) +{ + INIT_PAGE_LIST_HEAD(&ctx->arch.pgtables.list); + spin_lock_init(&ctx->arch.pgtables.lock); + + INIT_PAGE_LIST_HEAD(&ctx->arch.free_queue); + + return 0; +} + +int arch_iommu_context_teardown(struct domain *d, struct iommu_context *ct= x, u32 flags) { + /* Cleanup all page tables */ + while ( iommu_free_pgtables(d, ctx) =3D=3D -ERESTART ) + /* nothing */; + + return 0; +} + +int arch_iommu_flush_free_queue(struct domain *d, struct iommu_context *ct= x) +{ + struct page_info *pg; struct domain_iommu *hd =3D dom_iommu(d); =20 - spin_lock_init(&hd->arch.mapping_lock); + while ( (pg =3D page_list_remove_head(&ctx->arch.free_queue)) ) + iommu_arena_free_page(&hd->arch.pt_arena, pg); + + return 0; +} + +int arch_iommu_domain_init(struct domain *d) +{ + struct domain_iommu *hd =3D dom_iommu(d); =20 - INIT_PAGE_LIST_HEAD(&hd->arch.pgtables.list); - spin_lock_init(&hd->arch.pgtables.lock); INIT_LIST_HEAD(&hd->arch.identity_maps); + iommu_arena_initialize(&hd->arch.pt_arena, NULL, iommu_hwdom_arena_ord= er, 0); =20 return 0; } @@ -203,8 +240,9 @@ void arch_iommu_domain_destroy(struct domain *d) * domain is destroyed. Note that arch_iommu_domain_destroy() is * called unconditionally, so pgtables may be uninitialized. */ - ASSERT(!dom_iommu(d)->platform_ops || - page_list_empty(&dom_iommu(d)->arch.pgtables.list)); + struct domain_iommu *hd =3D dom_iommu(d); + + ASSERT(!hd->platform_ops); } =20 struct identity_map { @@ -227,7 +265,7 @@ int iommu_identity_mapping(struct domain *d, p2m_access= _t p2ma, ASSERT(base < end); =20 /* - * No need to acquire hd->arch.mapping_lock: Both insertion and removal + * No need to acquire hd->arch.lock: Both insertion and removal * get done while holding pcidevs_lock. */ list_for_each_entry( map, &hd->arch.identity_maps, list ) @@ -356,8 +394,8 @@ static int __hwdom_init cf_check identity_map(unsigned = long s, unsigned long e, */ if ( iomem_access_permitted(d, s, s) ) { - rc =3D iommu_map(d, _dfn(s), _mfn(s), 1, perms, - &info->flush_flags); + rc =3D _iommu_map(d, _dfn(s), _mfn(s), 1, perms, + &info->flush_flags, 0); if ( rc < 0 ) break; /* Must map a frame at least, which is what we request for= . */ @@ -366,8 +404,8 @@ static int __hwdom_init cf_check identity_map(unsigned = long s, unsigned long e, } s++; } - while ( (rc =3D iommu_map(d, _dfn(s), _mfn(s), e - s + 1, - perms, &info->flush_flags)) > 0 ) + while ( (rc =3D _iommu_map(d, _dfn(s), _mfn(s), e - s + 1, + perms, &info->flush_flags, 0)) > 0 ) { s +=3D rc; process_pending_softirqs(); @@ -533,7 +571,6 @@ void __hwdom_init arch_iommu_hwdom_init(struct domain *= d) =20 void arch_pci_init_pdev(struct pci_dev *pdev) { - pdev->arch.pseudo_domid =3D DOMID_INVALID; } =20 unsigned long *__init iommu_init_domid(domid_t reserve) @@ -564,8 +601,6 @@ domid_t iommu_alloc_domid(unsigned long *map) static unsigned int start; unsigned int idx =3D find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, = start); =20 - ASSERT(pcidevs_locked()); - if ( idx >=3D UINT16_MAX - DOMID_MASK ) idx =3D find_first_zero_bit(map, UINT16_MAX - DOMID_MASK); if ( idx >=3D UINT16_MAX - DOMID_MASK ) @@ -591,7 +626,7 @@ void iommu_free_domid(domid_t domid, unsigned long *map) BUG(); } =20 -int iommu_free_pgtables(struct domain *d) +int iommu_free_pgtables(struct domain *d, struct iommu_context *ctx) { struct domain_iommu *hd =3D dom_iommu(d); struct page_info *pg; @@ -601,17 +636,20 @@ int iommu_free_pgtables(struct domain *d) return 0; =20 /* After this barrier, no new IOMMU mappings can be inserted. */ - spin_barrier(&hd->arch.mapping_lock); + spin_barrier(&ctx->arch.pgtables.lock); =20 /* * Pages will be moved to the free list below. So we want to * clear the root page-table to avoid any potential use after-free. */ - iommu_vcall(hd->platform_ops, clear_root_pgtable, d); + iommu_vcall(hd->platform_ops, clear_root_pgtable, d, ctx); =20 - while ( (pg =3D page_list_remove_head(&hd->arch.pgtables.list)) ) + while ( (pg =3D page_list_remove_head(&ctx->arch.pgtables.list)) ) { - free_domheap_page(pg); + if (ctx->id =3D=3D 0) + free_domheap_page(pg); + else + iommu_arena_free_page(&hd->arch.pt_arena, pg); =20 if ( !(++done & 0xff) && general_preempt_check() ) return -ERESTART; @@ -621,6 +659,7 @@ int iommu_free_pgtables(struct domain *d) } =20 struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd, + struct iommu_context *ctx, uint64_t contig_mask) { unsigned int memflags =3D 0; @@ -632,7 +671,11 @@ struct page_info *iommu_alloc_pgtable(struct domain_io= mmu *hd, memflags =3D MEMF_node(hd->node); #endif =20 - pg =3D alloc_domheap_page(NULL, memflags); + if (ctx->id =3D=3D 0) + pg =3D alloc_domheap_page(NULL, memflags); + else + pg =3D iommu_arena_allocate_page(&hd->arch.pt_arena); + if ( !pg ) return NULL; =20 @@ -665,9 +708,7 @@ struct page_info *iommu_alloc_pgtable(struct domain_iom= mu *hd, =20 unmap_domain_page(p); =20 - spin_lock(&hd->arch.pgtables.lock); - page_list_add(pg, &hd->arch.pgtables.list); - spin_unlock(&hd->arch.pgtables.lock); + page_list_add(pg, &ctx->arch.pgtables.list); =20 return pg; } @@ -706,17 +747,22 @@ static void cf_check free_queued_pgtables(void *arg) } } =20 -void iommu_queue_free_pgtable(struct domain_iommu *hd, struct page_info *p= g) +void iommu_queue_free_pgtable(struct iommu_context *ctx, struct page_info = *pg) { unsigned int cpu =3D smp_processor_id(); =20 - spin_lock(&hd->arch.pgtables.lock); - page_list_del(pg, &hd->arch.pgtables.list); - spin_unlock(&hd->arch.pgtables.lock); + spin_lock(&ctx->arch.pgtables.lock); + page_list_del(pg, &ctx->arch.pgtables.list); + spin_unlock(&ctx->arch.pgtables.lock); =20 - page_list_add_tail(pg, &per_cpu(free_pgt_list, cpu)); + if ( !ctx->id ) + { + page_list_add_tail(pg, &per_cpu(free_pgt_list, cpu)); =20 - tasklet_schedule(&per_cpu(free_pgt_tasklet, cpu)); + tasklet_schedule(&per_cpu(free_pgt_tasklet, cpu)); + } + else + page_list_add_tail(pg, &ctx->arch.free_queue); } =20 static int cf_check cpu_callback( --=20 2.45.2 Teddy Astie | Vates XCP-ng Intern XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech From nobody Mon Nov 25 02:36:33 2024 Delivered-To: importer@patchew.org Received-SPF: pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) client-ip=192.237.175.120; envelope-from=xen-devel-bounces@lists.xenproject.org; helo=lists.xenproject.org; Authentication-Results: mx.zohomail.com; dkim=pass header.i=teddy.astie@vates.tech; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass(p=quarantine dis=none) header.from=vates.tech ARC-Seal: i=1; a=rsa-sha256; t=1718281032; cv=none; d=zohomail.com; s=zohoarc; b=GNRMTmShrqEpRMNglR61zxTVH997dMzEbauqZEyazasv4qiQOs8fD3v6p3kzTGG+7SX9VyzN1t+qpOyKVRRaYxwGFi+XQwtNvVNwDb8MyruAbRURbLmpdTbXTX93MOeJKp9TP3opRSUeJMKy272KTGlT2cs0xKg04aTLz0URfrM= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1718281032; h=Content-Type:Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:Subject:To:To:Message-Id:Reply-To; bh=E+4kMkd4WCWgQP6h5CFZEH7O128Jlpqjeek1o3nsjgc=; b=NXw8Tv/PKDPqRG0SFkpZnOAf5EcGsqMvQB4HwCi6L56wAooH9oXG8IbVjV9MIHtHzoUOWmQmM3hMVu4toIndm36B6xfXFtds/Q5cUattP6iU41yF2sz5WAF8Je8HFcLj0YpezCJn5GSLiJHpNLB1s7HYp6UapgUQQ2RGsnufzkE= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=teddy.astie@vates.tech; spf=pass (zohomail.com: domain of lists.xenproject.org designates 192.237.175.120 as permitted sender) smtp.mailfrom=xen-devel-bounces@lists.xenproject.org; dmarc=pass header.from= (p=quarantine dis=none) Return-Path: Received: from lists.xenproject.org (lists.xenproject.org [192.237.175.120]) by mx.zohomail.com with SMTPS id 1718281032542185.5881738702701; Thu, 13 Jun 2024 05:17:12 -0700 (PDT) Received: from list by lists.xenproject.org with outflank-mailman.739887.1146868 (Exim 4.92) (envelope-from ) id 1sHjNs-0000lC-E6; Thu, 13 Jun 2024 12:16:52 +0000 Received: by outflank-mailman (output) from mailman id 739887.1146868; Thu, 13 Jun 2024 12:16:52 +0000 Received: from localhost ([127.0.0.1] helo=lists.xenproject.org) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1sHjNs-0000l3-BM; Thu, 13 Jun 2024 12:16:52 +0000 Received: by outflank-mailman (input) for mailman id 739887; Thu, 13 Jun 2024 12:16:50 +0000 Received: from se1-gles-sth1-in.inumbo.com ([159.253.27.254] helo=se1-gles-sth1.inumbo.com) by lists.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1sHjNq-0008RX-Df for xen-devel@lists.xenproject.org; Thu, 13 Jun 2024 12:16:50 +0000 Received: from mail187-11.suw11.mandrillapp.com (mail187-11.suw11.mandrillapp.com [198.2.187.11]) by se1-gles-sth1.inumbo.com (Halon) with ESMTPS id ceda6a83-297e-11ef-90a3-e314d9c70b13; Thu, 13 Jun 2024 14:16:49 +0200 (CEST) Received: from pmta09.mandrill.prod.suw01.rsglab.com (localhost [127.0.0.1]) by mail187-11.suw11.mandrillapp.com (Mailchimp) with ESMTP id 4W0LxX1s1MzLfMD8p for ; Thu, 13 Jun 2024 12:16:48 +0000 (GMT) Received: from [37.26.189.201] by mandrillapp.com id 31a1578ca9b74cb5801c7c02def44933; Thu, 13 Jun 2024 12:16:48 +0000 X-Outflank-Mailman: Message body and most headers restored to incoming version X-BeenThere: xen-devel@lists.xenproject.org List-Id: Xen developer discussion List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" X-Inumbo-ID: ceda6a83-297e-11ef-90a3-e314d9c70b13 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; s=mte1; t=1718281008; x=1718541508; bh=E+4kMkd4WCWgQP6h5CFZEH7O128Jlpqjeek1o3nsjgc=; h=From:Subject:To:Cc:Message-Id:In-Reply-To:References:Feedback-ID: Date:MIME-Version:Content-Type:Content-Transfer-Encoding:CC:Date: Subject:From; b=jmXq1t97GCDIkYd4WQn8DiU6+15sv/usr0KGDOPio4RI/o2jmRZTUtR1/GEKDTAWC HZrpdtXOoo0g6Kh9LDnuFsINpe7JHMxBA/9WQxb+eBfs0uoDH8QVbClFhtObE01ES5 9aw+6fcjtXEb6k788wcKJkmUkaUceBbiOa/cyg1K5u6faCeCFCqNYo3E5N7b8f9G6s yp9/FOySf6N1DSVTsHzxTEeFvnFruqIr+lIIiykvfxkuQcP1kTZvdUxk6YqFKQhqqs BtsOIsNDqdpMHQznuN4yERUGYf4yA7BbhHHdmzbBwsfATan5Du9PNEqOouiIseSnlp wO3W1nF1/nFyg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vates.tech; s=mte1; t=1718281008; x=1718541508; i=teddy.astie@vates.tech; bh=E+4kMkd4WCWgQP6h5CFZEH7O128Jlpqjeek1o3nsjgc=; h=From:Subject:To:Cc:Message-Id:In-Reply-To:References:Feedback-ID: Date:MIME-Version:Content-Type:Content-Transfer-Encoding:CC:Date: Subject:From; b=LFHJAltL5CmM+jqSS8Doz+wSWniZGuectFQTDYND8gKL10UtHT0e7yMDSP8TLGirM RfBhlvwlbzn8/068Xx1KSdXCry8c47jIA1tRKvDJ2QvMO9M3uRD9/mqgGlNMUWRhx+ rngoUbwsham01/4mkAedLZ+P5+xL9DXAWLjSELhbEoowx/guQwGux0lq/syxHNoh9u AeLAiF3mNO7hnYQaMrgx8vjP+dEJzKgMLuzaRqQ4XV+tHZ9tSi5ATLIMKwOvN3j7zG vJ6HowwmAy/AfQ21QQBwoJraiX60ZiCui7ho/s/S/QJqyiX+t+Da/I/RtzbQ4o5gIJ 8xeBLkdG+yBKA== From: Teddy Astie Subject: =?utf-8?Q?[RFC=20XEN=20PATCH=205/5]=20xen/public:=20Introduce=20PV-IOMMU=20hypercall=20interface?= X-Mailer: git-send-email 2.45.2 X-Bm-Disclaimer: Yes X-Bm-Milter-Handled: 4ffbd6c1-ee69-4e1b-aabd-f977039bd3e2 X-Bm-Transport-Timestamp: 1718281006122 To: xen-devel@lists.xenproject.org Cc: Teddy Astie , Andrew Cooper , George Dunlap , Jan Beulich , Julien Grall , Stefano Stabellini , =?utf-8?Q?Marek=20Marczykowski-G=C3=B3recki?= Message-Id: In-Reply-To: References: X-Native-Encoded: 1 X-Report-Abuse: =?UTF-8?Q?Please=20forward=20a=20copy=20of=20this=20message,=20including=20all=20headers,=20to=20abuse@mandrill.com.=20You=20can=20also=20report=20abuse=20here:=20https://mandrillapp.com/contact/abuse=3Fid=3D30504962.31a1578ca9b74cb5801c7c02def44933?= X-Mandrill-User: md_30504962 Feedback-ID: 30504962:30504962.20240613:md Date: Thu, 13 Jun 2024 12:16:48 +0000 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMail-DKIM: pass (identity @mandrillapp.com) (identity teddy.astie@vates.tech) X-ZM-MESSAGEID: 1718281033369100001 Content-Type: text/plain; charset="utf-8" Introduce a new pv interface to manage the underlying IOMMU and manage cont= exts and devices. This interface allows creation of new contexts from Dom0 and addition of IOMMU mappings using guest PoV. This interface doesn't allow creation of mappings to other domains. Signed-off-by Teddy Astie --- Missing in this RFC Usage of PV-IOMMU inside DomU Differences with Malcolm Crossley PV IOMMU RFC [1] : Original PV IOMMU interfaces exposes IOMMU subsystem operations to the gues= t, in some way, it still has the limitations of the Xen IOMMU subsystem. For instance, all devices are bound to a single IOMMU domain, and this subsystem can only modify a domain-wide one. The main goal of the original implementation is to allow implementing vGPU = by mapping other guests into devices's address space (actually shared for all = devices of the domain). This original implementation draft cannot work with PVH (due to HAP P2M bei= ng immutable from IOMMU driver point of view) and cannot be used for implementing the Li= nux IOMMU subsystem (due to inability to create separate iommu domains). This new proposal aims for supporting the Linux IOMMU subsystem from Dom0 (= and DomU in the future). It needs to allow creation and management of IOMMU domains = (named IOMMU context) separated from the "default context" on a per-domain basis. = There is no foreign mapping support (yet) and emphasis on uses of Linux IOMMU subsys= tem (e.g DMA protection and VFIO). [1] https://lore.kernel.org/all/1455099035-17649-2-git-send-email-malcolm.c= rossley@citrix.com/ --- xen/common/Makefile | 1 + xen/common/pv-iommu.c | 320 ++++++++++++++++++++++++++++++++++ xen/include/hypercall-defs.c | 6 + xen/include/public/pv-iommu.h | 114 ++++++++++++ xen/include/public/xen.h | 1 + 5 files changed, 442 insertions(+) create mode 100644 xen/common/pv-iommu.c create mode 100644 xen/include/public/pv-iommu.h diff --git a/xen/common/Makefile b/xen/common/Makefile index d512cad524..336c5ea143 100644 --- a/xen/common/Makefile +++ b/xen/common/Makefile @@ -57,6 +57,7 @@ obj-y +=3D wait.o obj-bin-y +=3D warning.init.o obj-$(CONFIG_XENOPROF) +=3D xenoprof.o obj-y +=3D xmalloc_tlsf.o +obj-y +=3D pv-iommu.o =20 obj-bin-$(CONFIG_X86) +=3D $(foreach n,decompress bunzip2 unxz unlzma lzo = unlzo unlz4 unzstd earlycpio,$(n).init.o) =20 diff --git a/xen/common/pv-iommu.c b/xen/common/pv-iommu.c new file mode 100644 index 0000000000..844642ee54 --- /dev/null +++ b/xen/common/pv-iommu.c @@ -0,0 +1,320 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * xen/common/pv_iommu.c + * + * PV-IOMMU hypercall interface. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define PVIOMMU_PREFIX "[PV-IOMMU] " + +#define PVIOMMU_MAX_PAGES 256 /* Move to Kconfig ? */ + +/* Allowed masks for each sub-operation */ +#define ALLOC_OP_FLAGS_MASK (0) +#define FREE_OP_FLAGS_MASK (IOMMU_TEARDOWN_REATTACH_DEFAULT) + +static int get_paged_frame(struct domain *d, gfn_t gfn, mfn_t *mfn, + struct page_info **page, int readonly) +{ + p2m_type_t p2mt; + + *page =3D get_page_from_gfn(d, gfn_x(gfn), &p2mt, + (readonly) ? P2M_ALLOC : P2M_UNSHARE); + + if ( !(*page) ) + { + *mfn =3D INVALID_MFN; + if ( p2m_is_shared(p2mt) ) + return -EINVAL; + if ( p2m_is_paging(p2mt) ) + { + p2m_mem_paging_populate(d, gfn); + return -EIO; + } + + return -EPERM; + } + + *mfn =3D page_to_mfn(*page); + + return 0; +} + +static int can_use_iommu_check(struct domain *d) +{ + if ( !iommu_enabled ) + { + printk(PVIOMMU_PREFIX "IOMMU is not enabled\n"); + return 0; + } + + if ( !is_hardware_domain(d) ) + { + printk(PVIOMMU_PREFIX "Non-hardware domain\n"); + return 0; + } + + if ( !is_iommu_enabled(d) ) + { + printk(PVIOMMU_PREFIX "IOMMU disabled for this domain\n"); + return 0; + } + + return 1; +} + +static long query_cap_op(struct pv_iommu_op *op, struct domain *d) +{ + op->cap.max_ctx_no =3D d->iommu.other_contexts.count; + op->cap.max_nr_pages =3D PVIOMMU_MAX_PAGES; + op->cap.max_iova_addr =3D (1LLU << 39) - 1; /* TODO: hardcoded 39-bits= */ + + return 0; +} + +static long alloc_context_op(struct pv_iommu_op *op, struct domain *d) +{ + u16 ctx_no =3D 0; + int status =3D 0; + + status =3D iommu_context_alloc(d, &ctx_no, op->flags & ALLOC_OP_FLAGS_= MASK); + + if (status < 0) + return status; + + printk("Created context %hu\n", ctx_no); + + op->ctx_no =3D ctx_no; + return 0; +} + +static long free_context_op(struct pv_iommu_op *op, struct domain *d) +{ + return iommu_context_free(d, op->ctx_no, + IOMMU_TEARDOWN_PREEMPT | (op->flags & FREE_O= P_FLAGS_MASK)); +} + +static long reattach_device_op(struct pv_iommu_op *op, struct domain *d) +{ + struct physdev_pci_device dev =3D op->reattach_device.dev; + device_t *pdev; + + pdev =3D pci_get_pdev(d, PCI_SBDF(dev.seg, dev.bus, dev.devfn)); + + if ( !pdev ) + return -ENOENT; + + return iommu_reattach_context(d, d, pdev, op->ctx_no); +} + +static long map_pages_op(struct pv_iommu_op *op, struct domain *d) +{ + int ret =3D 0, flush_ret; + struct page_info *page =3D NULL; + mfn_t mfn; + unsigned int flags; + unsigned int flush_flags =3D 0; + size_t i =3D 0; + + if ( op->map_pages.nr_pages > PVIOMMU_MAX_PAGES ) + return -E2BIG; + + if ( !iommu_check_context(d, op->ctx_no) ) + return -EINVAL; + + //printk("Mapping gfn:%lx-%lx to dfn:%lx-%lx on %hu\n", + // op->map_pages.gfn, op->map_pages.gfn + op->map_pages.nr_pages= - 1, + // op->map_pages.dfn, op->map_pages.dfn + op->map_pages.nr_pages= - 1, + // op->ctx_no); + + flags =3D 0; + + if ( op->flags & IOMMU_OP_readable ) + flags |=3D IOMMUF_readable; + + if ( op->flags & IOMMU_OP_writeable ) + flags |=3D IOMMUF_writable; + + for (i =3D 0; i < op->map_pages.nr_pages; i++) + { + gfn_t gfn =3D _gfn(op->map_pages.gfn + i); + dfn_t dfn =3D _dfn(op->map_pages.dfn + i); + + /* Lookup pages struct backing gfn */ + ret =3D get_paged_frame(d, gfn, &mfn, &page, 0); + + if ( ret ) + break; + + /* Check for conflict with existing mappings */ + if ( !iommu_lookup_page(d, dfn, &mfn, &flags, op->ctx_no) ) + { + put_page(page); + ret =3D -EADDRINUSE; + break; + } + + ret =3D iommu_map(d, dfn, mfn, 1, flags, &flush_flags, op->ctx_no); + + if ( ret ) + break; + } + + op->map_pages.mapped =3D i; + + flush_ret =3D iommu_iotlb_flush(d, _dfn(op->map_pages.dfn), + op->map_pages.nr_pages, flush_flags, + op->ctx_no); + + if ( flush_ret ) + printk("Flush operation failed (%d)\n", flush_ret); + + return ret; +} + +static long unmap_pages_op(struct pv_iommu_op *op, struct domain *d) +{ + mfn_t mfn; + int ret =3D 0, flush_ret; + unsigned int flags; + unsigned int flush_flags =3D 0; + size_t i =3D 0; + + if ( op->unmap_pages.nr_pages > PVIOMMU_MAX_PAGES ) + return -E2BIG; + + if ( !iommu_check_context(d, op->ctx_no) ) + return -EINVAL; + + //printk("Unmapping dfn:%lx-%lx on %hu\n", + // op->unmap_pages.dfn, op->unmap_pages.dfn + op->unmap_pages.nr= _pages - 1, + // op->ctx_no); + + for (i =3D 0; i < op->unmap_pages.nr_pages; i++) + { + dfn_t dfn =3D _dfn(op->unmap_pages.dfn + i); + + /* Check if there is a valid mapping for this domain */ + if ( iommu_lookup_page(d, dfn, &mfn, &flags, op->ctx_no) ) { + ret =3D -ENOENT; + break; + } + + ret =3D iommu_unmap(d, dfn, 1, 0, &flush_flags, op->ctx_no); + + if (ret) + break; + + /* Decrement reference counter */ + put_page(mfn_to_page(mfn)); + } + + op->unmap_pages.unmapped =3D i; + + flush_ret =3D iommu_iotlb_flush(d, _dfn(op->unmap_pages.dfn), + op->unmap_pages.nr_pages, flush_flags, + op->ctx_no); + + if ( flush_ret ) + printk("Flush operation failed (%d)\n", flush_ret); + + return ret; +} + +static long lookup_page_op(struct pv_iommu_op *op, struct domain *d) +{ + mfn_t mfn; + gfn_t gfn; + unsigned int flags =3D 0; + + if ( !iommu_check_context(d, op->ctx_no) ) + return -EINVAL; + + /* Check if there is a valid BFN mapping for this domain */ + if ( iommu_lookup_page(d, _dfn(op->lookup_page.dfn), &mfn, &flags, op-= >ctx_no) ) + return -ENOENT; + + gfn =3D mfn_to_gfn(d, mfn); + BUG_ON(gfn_eq(gfn, INVALID_GFN)); + + op->lookup_page.gfn =3D gfn_x(gfn); + + return 0; +} + +long do_iommu_sub_op(struct pv_iommu_op *op) +{ + struct domain *d =3D current->domain; + + if ( !can_use_iommu_check(d) ) + return -EPERM; + + switch ( op->subop_id ) + { + case 0: + return 0; + + case IOMMUOP_query_capabilities: + return query_cap_op(op, d); + + case IOMMUOP_alloc_context: + return alloc_context_op(op, d); + + case IOMMUOP_free_context: + return free_context_op(op, d); + + case IOMMUOP_reattach_device: + return reattach_device_op(op, d); + + case IOMMUOP_map_pages: + return map_pages_op(op, d); + + case IOMMUOP_unmap_pages: + return unmap_pages_op(op, d); + + case IOMMUOP_lookup_page: + return lookup_page_op(op, d); + + default: + return -EINVAL; + } +} + +long do_iommu_op(XEN_GUEST_HANDLE_PARAM(void) arg) +{ + long ret =3D 0; + struct pv_iommu_op op; + + if ( unlikely(copy_from_guest(&op, arg, 1)) ) + return -EFAULT; + + ret =3D do_iommu_sub_op(&op); + + if ( ret =3D=3D -ERESTART ) + return hypercall_create_continuation(__HYPERVISOR_iommu_op, "h", a= rg); + + if ( unlikely(copy_to_guest(arg, &op, 1)) ) + return -EFAULT; + + return ret; +} + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * tab-width: 4 + * indent-tabs-mode: nil + * End: + */ diff --git a/xen/include/hypercall-defs.c b/xen/include/hypercall-defs.c index 47c093acc8..84db1ab3c3 100644 --- a/xen/include/hypercall-defs.c +++ b/xen/include/hypercall-defs.c @@ -209,6 +209,9 @@ hypfs_op(unsigned int cmd, const char *arg1, unsigned l= ong arg2, void *arg3, uns #ifdef CONFIG_X86 xenpmu_op(unsigned int op, xen_pmu_params_t *arg) #endif +#ifdef CONFIG_HAS_PASSTHROUGH +iommu_op(void *arg) +#endif =20 #ifdef CONFIG_PV caller: pv64 @@ -295,5 +298,8 @@ mca do do - = - - #ifndef CONFIG_PV_SHIM_EXCLUSIVE paging_domctl_cont do do do do - #endif +#ifdef CONFIG_HAS_PASSTHROUGH +iommu_op do do do do - +#endif =20 #endif /* !CPPCHECK */ diff --git a/xen/include/public/pv-iommu.h b/xen/include/public/pv-iommu.h new file mode 100644 index 0000000000..45f9c44eb1 --- /dev/null +++ b/xen/include/public/pv-iommu.h @@ -0,0 +1,114 @@ +/* SPDX-License-Identifier: MIT */ +/*************************************************************************= ***** + * pv-iommu.h + * + * Paravirtualized IOMMU driver interface. + * + * Copyright (c) 2024 Teddy Astie + */ + +#ifndef __XEN_PUBLIC_PV_IOMMU_H__ +#define __XEN_PUBLIC_PV_IOMMU_H__ + +#include "xen.h" +#include "physdev.h" + +#define IOMMU_DEFAULT_CONTEXT (0) + +/** + * Query PV-IOMMU capabilities for this domain. + */ +#define IOMMUOP_query_capabilities 1 + +/** + * Allocate an IOMMU context, the new context handle will be written to ct= x_no. + */ +#define IOMMUOP_alloc_context 2 + +/** + * Destroy a IOMMU context. + * All devices attached to this context are reattached to default context. + * + * The default context can't be destroyed (0). + */ +#define IOMMUOP_free_context 3 + +/** + * Reattach the device to IOMMU context. + */ +#define IOMMUOP_reattach_device 4 + +#define IOMMUOP_map_pages 5 +#define IOMMUOP_unmap_pages 6 + +/** + * Get the GFN associated to a specific DFN. + */ +#define IOMMUOP_lookup_page 7 + +struct pv_iommu_op { + uint16_t subop_id; + uint16_t ctx_no; + +/** + * Create a context that is cloned from default. + * The new context will be populated with 1:1 mappings covering the entire= guest memory. + */ +#define IOMMU_CREATE_clone (1 << 0) + +#define IOMMU_OP_readable (1 << 0) +#define IOMMU_OP_writeable (1 << 1) + uint32_t flags; + + union { + struct { + uint64_t gfn; + uint64_t dfn; + /* Number of pages to map */ + uint32_t nr_pages; + /* Number of pages actually mapped after sub-op */ + uint32_t mapped; + } map_pages; + + struct { + uint64_t dfn; + /* Number of pages to unmap */ + uint32_t nr_pages; + /* Number of pages actually unmapped after sub-op */ + uint32_t unmapped; + } unmap_pages; + + struct { + struct physdev_pci_device dev; + } reattach_device; + + struct { + uint64_t gfn; + uint64_t dfn; + } lookup_page; + + struct { + /* Maximum number of IOMMU context this domain can use. */ + uint16_t max_ctx_no; + /* Maximum number of pages that can be modified in a single ma= p/unmap operation. */ + uint32_t max_nr_pages; + /* Maximum device address (iova) that the guest can use for ma= ppings. */ + uint64_t max_iova_addr; + } cap; + }; +}; + +typedef struct pv_iommu_op pv_iommu_op_t; +DEFINE_XEN_GUEST_HANDLE(pv_iommu_op_t); + +#endif + +/* + * Local variables: + * mode: C + * c-file-style: "BSD" + * c-basic-offset: 4 + * tab-width: 4 + * indent-tabs-mode: nil + * End: + */ \ No newline at end of file diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h index b47d48d0e2..28ab815ebc 100644 --- a/xen/include/public/xen.h +++ b/xen/include/public/xen.h @@ -118,6 +118,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_ulong_t); #define __HYPERVISOR_xenpmu_op 40 #define __HYPERVISOR_dm_op 41 #define __HYPERVISOR_hypfs_op 42 +#define __HYPERVISOR_iommu_op 43 =20 /* Architecture-specific hypercall definitions. */ #define __HYPERVISOR_arch_0 48 --=20 2.45.2 Teddy Astie | Vates XCP-ng Intern XCP-ng & Xen Orchestra - Vates solutions web: https://vates.tech