Xen Security Advisory 400 v2 (CVE-2022-26358,CVE-2022-26359,CVE-2022-26360,CVE-2022-26361) - IOMMU: RMRR (VT-d) and unity map (AMD-Vi) handling issues

Xen.org security team posted 1 patch 2 years, 1 month ago
Failed in applying to current master (apply log)
Xen Security Advisory 400 v2 (CVE-2022-26358,CVE-2022-26359,CVE-2022-26360,CVE-2022-26361) - IOMMU: RMRR (VT-d) and unity map (AMD-Vi) handling issues
Posted by Xen.org security team 2 years, 1 month ago
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

 Xen Security Advisory CVE-2022-26358,CVE-2022-26359,CVE-2022-26360,CVE-2022-26361 / XSA-400
                                          version 2

       IOMMU: RMRR (VT-d) and unity map (AMD-Vi) handling issues

UPDATES IN VERSION 2
====================

Public release.

ISSUE DESCRIPTION
=================

Certain PCI devices in a system might be assigned Reserved Memory
Regions (specified via Reserved Memory Region Reporting, "RMRR") for
Intel VT-d or Unity Mapping ranges for AMD-Vi.  These are typically used
for platform tasks such as legacy USB emulation.

Since the precise purpose of these regions is unknown, once a device
associated with such a region is active, the mappings of these regions
need to remain continuouly accessible by the device.  This requirement
has been violated.

Subsequent DMA or interrupts from the device may have unpredictable
behaviour, ranging from IOMMU faults to memory corruption.

IMPACT
======

The precise impact is system specific, but would likely be a Denial of
Service (DoS) affecting the entire host.  Privilege escalation and
information leaks cannot be ruled out.

VULNERABLE SYSTEMS
==================

All Xen versions supporting PCI passthrough are affected.

Only x86 systems with IOMMU hardware are vulnerable.  Arm systems
as well as x86 systems without IOMMU hardware or without any IOMMUs in
use are not vulnerable.

Only x86 guests which have physical devices passed through to them,
and only when any such device has an associated RMRR or unity map, can
leverage the vulnerability. (Whether a device is associated with an RMRR
or unity map is not easy to discern.)

MITIGATION
==========

Not passing through physical devices to untrusted guests when the
devices have assoicated RMRRs / unity maps will avoid the vulnerability.

CREDITS
=======

Aspects of this issue were discovered by Jan Beulich of SUSE and
Roger Pau Monné of Citrix.

RESOLUTION
==========

Applying the appropriate set of attached patches resolves this issue.

Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball.  Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.

xsa400/xsa400-??.patch          xen-unstable
xsa400/xsa400-4.16-*.patch      Xen 4.16.x
xsa400/xsa400-4.15-*.patch      Xen 4.15.x
xsa400/xsa400-4.14-*.patch      Xen 4.14.x
xsa400/xsa400-4.13-*.patch      Xen 4.13.x
xsa400/xsa400-4.12-*.patch      Xen 4.12.x

$ sha256sum xsa400*/*
108544235a011e96e8717e9e744190fd8128e99cca8141f682667bd7b8533f45  xsa400/xsa400-01.patch
1648ef0213dd6beeaa782a18926416a00aa51ad89136aa1565debd5312bf58aa  xsa400/xsa400-02.patch
39e02738ce3d3a65c02355dd45cb844ecca8be9715b7887dedcdaada02c4bda1  xsa400/xsa400-03.patch
2d5e7f4090418d817df002b7cd7a9a40246ff9900faf53fc2b2f685ac77b2a0e  xsa400/xsa400-4.12-00.patch
a4291033d1960f8095a11e765072f60cf9e5be07d4cd0cbfed1896f9c66a4576  xsa400/xsa400-4.12-01.patch
2f5b2ce22ae6cba8a5380ccf63ae2dcf6e7ce8f29a33c87c0630abc6c1d24793  xsa400/xsa400-4.12-02.patch
86ee30d46957f63e4516ea47f14470579d0906fd536b39e31645e2cd471c1b49  xsa400/xsa400-4.12-03.patch
546430f61df3eb92a55e32047cce8557d85a8e860f67d558c35fea44e7c56b18  xsa400/xsa400-4.12-04.patch
178463bf81b138668c2f036b11a0b9406a6f5bb98c0643741787ccdf276495d9  xsa400/xsa400-4.12-05.patch
b19c599eb5eb17f8e0e17879174a4235fdc3f2172bc77bffdee60074ef2d7c90  xsa400/xsa400-4.12-06.patch
02eae481f2b06763dd913b46a65837791147b7916bbd0ed50e5d972784b310aa  xsa400/xsa400-4.12-07.patch
298339cc343d9803158abd0c36030f36e7dc8360ccfed320f34ff01813c47434  xsa400/xsa400-4.12-08.patch
9253d7120db39a1e3c14fc6b8735577f3d215837996eed057511146c0ba199d6  xsa400/xsa400-4.12-09.patch
762bbe418c028449812b4441c4e369230477eeea4d863c05a2efd9c7e5e19bc3  xsa400/xsa400-4.12-10.patch
7172957a27cdf02327c28b5cfedf35b12e76cb17237cbb9c318442961f503fe6  xsa400/xsa400-4.12-11.patch
c685d9ee2453728f509a78506930f756118ff443fbe6d5307873e90455b5e131  xsa400/xsa400-4.13-00.patch
0b0a80a249c9c58726f913df24163976cfaadabc685d3d274dd7b972fe4cbffc  xsa400/xsa400-4.13-01.patch
eabee7d9cabe00748e78fda0363818ca6085ac395804658a11f94394236d2734  xsa400/xsa400-4.13-02.patch
66cd2a0ee8a002668e04e2bb7192ac2f774012434765988bdea0a7e79621f58c  xsa400/xsa400-4.13-03.patch
4457c142c1fdc67e0dc77a57bc57159ee63dc1946e432ae1225937c7ed3a3e82  xsa400/xsa400-4.13-04.patch
2feabcbcf160d8c1130b2c86efd98dd9fee0332489520f3a10aa8bb37d6a5a20  xsa400/xsa400-4.13-05.patch
903099123d37806b7fe684d75ab0fb6a28b4609d7370cce65a9ba5ab4db43ddd  xsa400/xsa400-4.13-06.patch
e8a39d36f629b7f79634d09d22ace97f1c23ee10ee9fef6afd68ec2204c55d81  xsa400/xsa400-4.13-07.patch
d85c63917542d2d2b47dd6322b4c32364719be4707acf616290aacd536c57805  xsa400/xsa400-4.13-08.patch
5496ca04a2789603069ca8e8d3bf6d301c19c799ccff813bf17865b46f687757  xsa400/xsa400-4.13-09.patch
8b343f3616c0283caac334b4f0cc42b1523f2dc9de3463f2edadbf1a11c17a22  xsa400/xsa400-4.13-10.patch
577d20d2ed5da8a89f32eae717a140f9ec3900e31fc9e10ee23e37ae5dc19d4a  xsa400/xsa400-4.13-11.patch
bfa82c6beae9ec8ab88e042296442eed3ca162b42124c982785c7c0d95440480  xsa400/xsa400-4.14-01.patch
f8f7da8d0cc3d149874956e9aefe666a2e33275dba66ed0d55f0b559d078f01f  xsa400/xsa400-4.14-02.patch
49acd3d795aa091f8fa1c72b5064b5a71966e77c9785f5d0f8226d99daba3ead  xsa400/xsa400-4.14-03.patch
0ce4435a8c7fd6f4186ea31d29932094b1a902d9d0c5dafa3fcb1c15c5eca88c  xsa400/xsa400-4.14-04.patch
aeae4cef4bf31ecd854ad820239fe793f36ed81c0ad28ce3935ab3f83c3ee58c  xsa400/xsa400-4.14-05.patch
f8d93e2ab3f891f70abd17ddba9aae8237605f1f680754ff9df646e5ddb9d419  xsa400/xsa400-4.14-06.patch
afb9f731ba7a53930626e322ebb39db19f817b5c60710fc934c395fcdea4c7f4  xsa400/xsa400-4.14-07.patch
c236bb7b5c692cf30e7b84e6668f87b248bcf80c6df1dd74c655b91b955eb271  xsa400/xsa400-4.14-08.patch
6c4d44983eae92212be6bda7663198260bdeb0506c5b622c38ddbcedf9360d69  xsa400/xsa400-4.14-09.patch
5393c6681c675c666396f6725c2ac2e48293465e97bc228dcfc410ee84d8ba7c  xsa400/xsa400-4.14-10.patch
8d1dcdeb4d9420840a6753bb56f35bf199af63a6c477f2352e47222df5337c1a  xsa400/xsa400-4.14-11.patch
f52a70fd92c4819658630c9d478ae03a9801352b024240059877d0ee2002e31e  xsa400/xsa400-4.15-01.patch
243a0b20b20c78b41f833004c82ce26b1249285671f150bdfe4a2314860df316  xsa400/xsa400-4.15-02.patch
175f2472f80dacffcee578543f4fc4521f5533de46199ce86b072bfd0ede9ae3  xsa400/xsa400-4.15-03.patch
dee6fbe9e3c03d695bae1f81cebca80ca54ecf02a51db64ae5f3d313837eed8c  xsa400/xsa400-4.15-04.patch
55a35033b8ba45b1c9c556e0c2866733e518137299f2f3f4d41046766898aeb5  xsa400/xsa400-4.15-05.patch
94ec5289a3a632fed8d220478847ef3b780d7db345f30ee9d6b186905de61048  xsa400/xsa400-4.15-06.patch
50cbb08b931cd1cc8ec3e2aa17b537db80dcd03a6de0994331fc3818f53cbfe4  xsa400/xsa400-4.15-07.patch
e4c77fac42b8b0b50cdc9de30f406176c44de75d647e36b4bbd6360dd70c8aa6  xsa400/xsa400-4.15-08.patch
652cd8700b830b3520db325cfd90eab6c08bd423debb21c05caa7edfef9ee671  xsa400/xsa400-4.15-09.patch
50e58d663690475ce35a0a65eda88a2c2da319ddc02eb15b5a7d568a8f0a0366  xsa400/xsa400-4.15-10.patch
37a18da6ad1f529bfdc4156225f18c17c10a45302b7d6045ee38c934656589da  xsa400/xsa400-4.15-11.patch
ff069d123df2aba7a4f3185f21f7ca36b34fc026dcdd279224c86698f84c0975  xsa400/xsa400-4.15-12.patch
76497919563cdf2804f5071325c032acd04cf8df75c0dcb4b207a93b9ae80927  xsa400/xsa400-4.16-01.patch
150f2a7621ec17d6369a6ebbc2c08c502f3524acb89855a86a25d7b4fa3e3270  xsa400/xsa400-4.16-02.patch
cd59bca0fcab4bda1d0ea839501543a59b53aef61e96a0d949675bc5550a6fcb  xsa400/xsa400-4.16-03.patch
098edab3ed8915a3598badaa1d452f7c8ab2d8e72879dd9bf941b2093e6df9d1  xsa400/xsa400-4.16-04.patch
a85640291e1bc1cdb757172eb6d2036834ad2eb7b84252cc64d29db3feeff331  xsa400/xsa400-4.16-05.patch
99d43caf1ce60f421d940c2774e2a59d65d1f0dcccd941f13066117a19222b22  xsa400/xsa400-4.16-06.patch
fbe5ec58da594dffd3e63c18406280f27d976609350d7c7083acbf2f2e6538cd  xsa400/xsa400-4.16-07.patch
e6b0d400beda8ff0e9bdbfd033bc23935069b41dbd5dddf863dbd70de44d908f  xsa400/xsa400-4.16-08.patch
965f96bdb33a872942de42597c7cc32012020f37b25a84015ecc55387d6b07dd  xsa400/xsa400-4.16-09.patch
7996b1462374168ddc9cbc01c990d5cb40140d7a100096f284f6b74fb4ad7ec4  xsa400/xsa400-4.16-10.patch
23e325be799b299e9621e76eeff646b81518a4124474fb766d3bf6f0cc925083  xsa400/xsa400-4.16-11.patch
6fa43e1a8f53184724cd4a7b5c13f0fd699c998af968bafaefda570432b5a7d8  xsa400/xsa400-4.16-12.patch
82306d680ef445bee04969028184f65b5e106c89c308fbb876b858f77fea9506  xsa400/xsa400-04.patch
ccf3ff62b427d3d2ea46d4da96beeef2cb64674bc0d247352233c4b84a21f205  xsa400/xsa400-05.patch
8b3f767ca659e8bbf4983927999bdb92d9fa42e3a88973e22facda0e23f29a84  xsa400/xsa400-06.patch
bb570b89a2d4b535831d9211bf08cc0c62c060dc7808911afd8186082b884cf8  xsa400/xsa400-07.patch
8382fd5336b5b4b3388dee099da00b1e728dea77d96825c6088991d7e50f333b  xsa400/xsa400-08.patch
6909363863932ca77c9b7384516965bc1697fab9b4814294a1675fb6ce8e166e  xsa400/xsa400-09.patch
c37fee4dd9ab2bf64ebb8b9c553f793bfcef9a7cf1972b0000fe6ce28b9e8e60  xsa400/xsa400-10.patch
70f11f64457c9703f09c3121d08d5ff4676af20bd42eb1262c433a5e0f79ea41  xsa400/xsa400-11.patch
724e34d262939162ecab713c070cc07b8f7baeca50ee8b62bb59460ec4f7fbdb  xsa400/xsa400-12.patch
$

DEPLOYMENT DURING EMBARGO
=========================

Deployment of the patches described above (or others which are
substantially similar) is permitted during the embargo, even on
public-facing systems with untrusted guest users and administrators.

HOWEVER, deployment of the mitigation is NOT permitted (except where
all the affected systems and VMs are administered and used only by
organisations which are members of the Xen Project Security Issues
Predisclosure List).  Specifically, deployment on public cloud systems
is NOT permitted.

This is because removal of pass-through devices or their replacement by
emulated devices is a guest visible configuration change, which may lead
to re-discovery of the issue.

Deployment of this mitigation is permitted only AFTER the embargo ends.

AND: Distribution of updated software is prohibited (except to other
members of the predisclosure list).

(Note: this during-embargo deployment notice is retained in
post-embargo publicly released Xen Project advisories, even though it
is then no longer applicable.  This is to enable the community to have
oversight of the Xen Project Security Team's decisionmaking.)

For more information about permissible uses of embargoed information,
consult the Xen Project community's agreed Security Policy:
  http://www.xenproject.org/security-policy.html
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmJML+0MHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZQl8IALuh2RTTSiQBYeybKZPr3QEOfy+L3VFzNbRZeGBc
jpN12lXjzkFvVuTXDu6Cgw1g6yhRrcRnhGWhx3T8RpGeYhHq1GeKSw7+c4NTsiDL
21P9F3mEban6tOUa82V2PTkYfAMMpbn5apOcsMvdlPoEgXdgLPh5HoVIvXQQO6Ni
3MkCN7foV3M07jAiB4ehqrsPLhzGSCCRJfiD9PA1/RnnzCBrspyyWepF7mwzh2cx
1kENmG8mSwA3Eg0fNUMDBi6W+drNmIx4gooYo3LBnCyMgzUrv7+bzkvjpcU8+dAq
H5i2Morbx4j0F/TcQ8+23KjzNvJHbiqKVrIFCoa1z54dGC8=
=tz+l
-----END PGP SIGNATURE-----
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix (de)assign ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully updated.

Also adjust a related log message.

This is CVE-2022-26358 / part of XSA-400.

Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
v2: Don't unmap RMRRs for Dom0.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2414,6 +2414,10 @@ static int cf_check reassign_device_owne
 {
     int ret;
 
+    ret = domain_context_unmap(source, devfn, pdev);
+    if ( ret )
+        return ret;
+
     /*
      * If the device belongs to the hardware domain, and it has RMRR, don't
      * remove it from the hardware domain, because BIOS may use RMRR at
@@ -2442,10 +2446,6 @@ static int cf_check reassign_device_owne
             }
     }
 
-    ret = domain_context_unmap(source, devfn, pdev);
-    if ( ret )
-        return ret;
-
     if ( devfn == pdev->devfn && pdev->domain != dom_io )
     {
         list_move(&pdev->domain_list, &dom_io->pdev_list);
@@ -2530,9 +2530,8 @@ static int cf_check intel_iommu_assign_d
         }
     }
 
-    ret = reassign_device_ownership(s, d, devfn, pdev);
-    if ( ret || d == dom_io )
-        return ret;
+    if ( d == dom_io )
+        return reassign_device_ownership(s, d, devfn, pdev);
 
     /* Setup rmrr identity mapping */
     for_each_rmrr_device( rmrr, bdf, i )
@@ -2545,20 +2544,37 @@ static int cf_check intel_iommu_assign_d
                                          rmrr->end_address, flag);
             if ( ret )
             {
-                int rc;
-
-                rc = reassign_device_ownership(d, s, devfn, pdev);
                 printk(XENLOG_G_ERR VTDPREFIX
-                       " cannot map reserved region (%"PRIx64",%"PRIx64"] for Dom%d (%d)\n",
-                       rmrr->base_address, rmrr->end_address,
-                       d->domain_id, ret);
-                if ( rc )
-                {
-                    printk(XENLOG_ERR VTDPREFIX
-                           " failed to reclaim %pp from %pd (%d)\n",
-                           &PCI_SBDF3(seg, bus, devfn), d, rc);
-                    domain_crash(d);
-                }
+                       "%pd: cannot map reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, ret);
+                break;
+            }
+        }
+    }
+
+    if ( !ret )
+        ret = reassign_device_ownership(s, d, devfn, pdev);
+
+    /* See reassign_device_ownership() for the hwdom aspect. */
+    if ( !ret || is_hardware_domain(d) )
+        return ret;
+
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment == seg &&
+             PCI_BUS(bdf) == bus &&
+             PCI_DEVFN2(bdf) == devfn )
+        {
+            int rc = iommu_identity_mapping(d, p2m_access_x,
+                                            rmrr->base_address,
+                                            rmrr->end_address, 0);
+
+            if ( rc && rc != -ENOENT )
+            {
+                printk(XENLOG_ERR VTDPREFIX
+                       "%pd: cannot unmap reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, rc);
+                domain_crash(d);
                 break;
             }
         }
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix add/remove ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully cleared.

Also switch to %pd in related log messages.

Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR")
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
v7: Re-base.
v2: New.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1992,14 +1992,6 @@ static int cf_check intel_iommu_add_devi
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_mapping(pdev->domain, devfn, pdev);
-    if ( ret )
-    {
-        dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n",
-                pdev->domain->domain_id);
-        return ret;
-    }
-
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment == pdev->seg &&
@@ -2016,12 +2008,17 @@ static int cf_check intel_iommu_add_devi
                                          rmrr->base_address, rmrr->end_address,
                                          0);
             if ( ret )
-                dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
-                        pdev->domain->domain_id);
+                dprintk(XENLOG_ERR VTDPREFIX, "%pd: RMRR mapping failed\n",
+                        pdev->domain);
         }
     }
 
-    return 0;
+    ret = domain_context_mapping(pdev->domain, devfn, pdev);
+    if ( ret )
+        dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n",
+                pdev->domain);
+
+    return ret;
 }
 
 static int cf_check intel_iommu_enable_device(struct pci_dev *pdev)
@@ -2043,11 +2040,15 @@ static int cf_check intel_iommu_remove_d
 {
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int i;
+    int ret, i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
+    ret = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( ret )
+        return ret;
+
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment != pdev->seg ||
@@ -2063,7 +2064,7 @@ static int cf_check intel_iommu_remove_d
                                rmrr->end_address, 0);
     }
 
-    return domain_context_unmap(pdev->domain, devfn, pdev);
+    return 0;
 }
 
 static int __hwdom_init cf_check setup_hwdom_device(
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: tighten iommu_alloc_pgtable()'s parameter

This is to make more obvious that nothing outside of domain_iommu(d)
actually changes or is otherwise needed by the function.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
v3: Re-base.
v2: New.

--- a/xen/arch/x86/include/asm/iommu.h
+++ b/xen/arch/x86/include/asm/iommu.h
@@ -142,7 +142,8 @@ static inline void iommu_sync_cache(cons
 }
 
 int __must_check iommu_free_pgtables(struct domain *d);
-struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
+struct domain_iommu;
+struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -184,7 +184,7 @@ static int iommu_pde_from_dfn(struct dom
     unsigned long  next_table_mfn;
     unsigned int level;
     struct page_info *table;
-    const struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(d);
 
     table = hd->arch.amd.root_table;
     level = hd->arch.amd.paging_mode;
@@ -219,7 +219,7 @@ static int iommu_pde_from_dfn(struct dom
             mfn = next_table_mfn;
 
             /* allocate lower level page table */
-            table = iommu_alloc_pgtable(d);
+            table = iommu_alloc_pgtable(hd);
             if ( table == NULL )
             {
                 AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
@@ -249,7 +249,7 @@ static int iommu_pde_from_dfn(struct dom
 
             if ( next_table_mfn == 0 )
             {
-                table = iommu_alloc_pgtable(d);
+                table = iommu_alloc_pgtable(hd);
                 if ( table == NULL )
                 {
                     AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
@@ -555,7 +555,7 @@ int __init cf_check amd_iommu_quarantine
 
     spin_lock(&hd->arch.mapping_lock);
 
-    hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+    hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
     if ( !hd->arch.amd.root_table )
         goto out;
 
@@ -570,7 +570,7 @@ int __init cf_check amd_iommu_quarantine
          * page table pages, and the resulting allocations are always
          * zeroed.
          */
-        pg = iommu_alloc_pgtable(d);
+        pg = iommu_alloc_pgtable(hd);
         if ( !pg )
             break;
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -242,7 +242,7 @@ int amd_iommu_alloc_root(struct domain *
 
     if ( unlikely(!hd->arch.amd.root_table) )
     {
-        hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+        hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
         if ( !hd->arch.amd.root_table )
             return -ENOMEM;
     }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -318,7 +318,7 @@ static u64 addr_to_dma_page_maddr(struct
     {
         struct page_info *pg;
 
-        if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
+        if ( !alloc || !(pg = iommu_alloc_pgtable(hd)) )
             goto out;
 
         hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
@@ -338,7 +338,7 @@ static u64 addr_to_dma_page_maddr(struct
             if ( !alloc )
                 break;
 
-            pg = iommu_alloc_pgtable(domain);
+            pg = iommu_alloc_pgtable(hd);
             if ( !pg )
                 break;
 
@@ -2763,7 +2763,7 @@ static int __init cf_check intel_iommu_q
         goto out;
     }
 
-    pg = iommu_alloc_pgtable(d);
+    pg = iommu_alloc_pgtable(hd);
 
     rc = -ENOMEM;
     if ( !pg )
@@ -2782,7 +2782,7 @@ static int __init cf_check intel_iommu_q
          * page table pages, and the resulting allocations are always
          * zeroed.
          */
-        pg = iommu_alloc_pgtable(d);
+        pg = iommu_alloc_pgtable(hd);
 
         if ( !pg )
             goto out;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -420,9 +420,8 @@ int iommu_free_pgtables(struct domain *d
     return 0;
 }
 
-struct page_info *iommu_alloc_pgtable(struct domain *d)
+struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd)
 {
-    struct domain_iommu *hd = dom_iommu(d);
     unsigned int memflags = 0;
     struct page_info *pg;
     void *p;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: split domid map cleanup check into a function

This logic will want invoking from elsewhere.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -152,6 +152,68 @@ static void __init free_intel_iommu(stru
     xfree(intel);
 }
 
+static void cleanup_domid_map(struct domain *domain, struct iommu *iommu)
+{
+    int iommu_domid = domain_iommu_domid(domain, iommu);
+
+    if ( iommu_domid >= 0 )
+    {
+        /*
+         * Update domid_map[] /before/ domid_bitmap[] to avoid a race with
+         * context_set_domain_id(), setting the slot to DOMID_INVALID for
+         * ->domid_map[] reads to produce a suitable value while the bit is
+         * still set.
+         */
+        iommu->domid_map[iommu_domid] = DOMID_INVALID;
+        clear_bit(iommu_domid, iommu->domid_bitmap);
+    }
+}
+
+static bool any_pdev_behind_iommu(const struct domain *d,
+                                  const struct pci_dev *exclude,
+                                  const struct iommu *iommu)
+{
+    const struct pci_dev *pdev;
+
+    for_each_pdev ( d, pdev )
+    {
+        const struct acpi_drhd_unit *drhd;
+
+        if ( pdev == exclude )
+            continue;
+
+        drhd = acpi_find_matched_drhd_unit(pdev);
+        if ( drhd && drhd->iommu == iommu )
+            return true;
+    }
+
+    return false;
+}
+
+/*
+ * If no other devices under the same iommu owned by this domain,
+ * clear iommu in iommu_bitmap and clear domain_id in domid_bitmap.
+ */
+static void check_cleanup_domid_map(struct domain *d,
+                                    const struct pci_dev *exclude,
+                                    struct iommu *iommu)
+{
+    bool found = any_pdev_behind_iommu(d, exclude, iommu);
+
+    /*
+     * Hidden devices are associated with DomXEN but usable by the hardware
+     * domain. Hence they need considering here as well.
+     */
+    if ( !found && is_hardware_domain(d) )
+        found = any_pdev_behind_iommu(dom_xen, exclude, iommu);
+
+    if ( !found )
+    {
+        clear_bit(iommu->index, &dom_iommu(d)->arch.iommu_bitmap);
+        cleanup_domid_map(d, iommu);
+    }
+}
+
 static int iommus_incoherent;
 
 static void sync_cache(const void *addr, unsigned int size)
@@ -1671,7 +1733,6 @@ static int domain_context_unmap(struct d
     struct iommu *iommu;
     int ret = 0;
     u8 seg = pdev->seg, bus = pdev->bus, tmp_bus, tmp_devfn, secbus;
-    int found = 0;
 
     drhd = acpi_find_matched_drhd_unit(pdev);
     if ( !drhd )
@@ -1740,45 +1801,8 @@ static int domain_context_unmap(struct d
         goto out;
     }
 
-    /*
-     * if no other devices under the same iommu owned by this domain,
-     * clear iommu in iommu_bitmap and clear domain_id in domid_bitmp
-     */
-    for_each_pdev ( domain, pdev )
-    {
-        if ( pdev->seg == seg && pdev->bus == bus && pdev->devfn == devfn )
-            continue;
-
-        drhd = acpi_find_matched_drhd_unit(pdev);
-        if ( drhd && drhd->iommu == iommu )
-        {
-            found = 1;
-            break;
-        }
-    }
-
-    if ( found == 0 )
-    {
-        int iommu_domid;
-
-        clear_bit(iommu->index, &dom_iommu(domain)->arch.iommu_bitmap);
-
-        iommu_domid = domain_iommu_domid(domain, iommu);
-        if ( iommu_domid == -1 )
-        {
-            ret = -EINVAL;
-            goto out;
-        }
-
-        /*
-         * Update domid_map[] /before/ domid_bitmap[] to avoid a race with
-         * context_set_domain_id(), setting the slot to DOMID_INVALID for
-         * ->domid_map[] reads to produce a suitable value while the bit is
-         * still set.
-         */
-        iommu->domid_map[iommu_domid] = DOMID_INVALID;
-        clear_bit(iommu_domid, iommu->domid_bitmap);
-    }
+    if ( !ret )
+        check_cleanup_domid_map(domain, pdev, iommu);
 
 out:
     return ret;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix (de)assign ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully updated.

Also adjust a related log message.

This is CVE-2022-26358 / part of XSA-400.

Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2352,6 +2352,10 @@ static int reassign_device_ownership(
 {
     int ret;
 
+    ret = domain_context_unmap(source, devfn, pdev);
+    if ( ret )
+        return ret;
+
     /*
      * Devices assigned to untrusted domains (here assumed to be any domU)
      * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
@@ -2388,10 +2392,6 @@ static int reassign_device_ownership(
             }
     }
 
-    ret = domain_context_unmap(source, devfn, pdev);
-    if ( ret )
-        return ret;
-
     if ( devfn == pdev->devfn && pdev->domain != dom_io )
     {
         list_move(&pdev->domain_list, &dom_io->arch.pdev_list);
@@ -2468,9 +2468,8 @@ static int intel_iommu_assign_device(
         }
     }
 
-    ret = reassign_device_ownership(s, d, devfn, pdev);
-    if ( ret || d == dom_io )
-        return ret;
+    if ( d == dom_io )
+        return reassign_device_ownership(s, d, devfn, pdev);
 
     /* Setup rmrr identity mapping */
     for_each_rmrr_device( rmrr, bdf, i )
@@ -2483,20 +2482,37 @@ static int intel_iommu_assign_device(
                                          rmrr->end_address, flag);
             if ( ret )
             {
-                int rc;
-
-                rc = reassign_device_ownership(d, s, devfn, pdev);
                 printk(XENLOG_G_ERR VTDPREFIX
-                       " cannot map reserved region (%"PRIx64",%"PRIx64"] for Dom%d (%d)\n",
-                       rmrr->base_address, rmrr->end_address,
-                       d->domain_id, ret);
-                if ( rc )
-                {
-                    printk(XENLOG_ERR VTDPREFIX
-                           " failed to reclaim %04x:%02x:%02x.%u from %pd (%d)\n",
-                           seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), d, rc);
-                    domain_crash(d);
-                }
+                       "%pd: cannot map reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, ret);
+                break;
+            }
+        }
+    }
+
+    if ( !ret )
+        ret = reassign_device_ownership(s, d, devfn, pdev);
+
+    /* See reassign_device_ownership() for the hwdom aspect. */
+    if ( !ret || is_hardware_domain(d) )
+        return ret;
+
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment == seg &&
+             PCI_BUS(bdf) == bus &&
+             PCI_DEVFN2(bdf) == devfn )
+        {
+            int rc = iommu_identity_mapping(d, p2m_access_x,
+                                            rmrr->base_address,
+                                            rmrr->end_address, 0);
+
+            if ( rc && rc != -ENOENT )
+            {
+                printk(XENLOG_ERR VTDPREFIX
+                       "%pd: cannot unmap reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, rc);
+                domain_crash(d);
                 break;
             }
         }
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix add/remove ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully cleared.

Also switch to %pd in related log messages.

Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR")
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1985,14 +1985,6 @@ static int intel_iommu_add_device(u8 dev
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_mapping(pdev->domain, devfn, pdev);
-    if ( ret )
-    {
-        dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n",
-                pdev->domain->domain_id);
-        return ret;
-    }
-
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment == pdev->seg &&
@@ -2009,12 +2001,17 @@ static int intel_iommu_add_device(u8 dev
                                          rmrr->base_address, rmrr->end_address,
                                          0);
             if ( ret )
-                dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
-                        pdev->domain->domain_id);
+                dprintk(XENLOG_ERR VTDPREFIX, "%pd: RMRR mapping failed\n",
+                        pdev->domain);
         }
     }
 
-    return 0;
+    ret = domain_context_mapping(pdev->domain, devfn, pdev);
+    if ( ret )
+        dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n",
+                pdev->domain);
+
+    return ret;
 }
 
 static int intel_iommu_enable_device(struct pci_dev *pdev)
@@ -2036,11 +2033,15 @@ static int intel_iommu_remove_device(u8
 {
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int i;
+    int ret, i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
+    ret = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( ret )
+        return ret;
+
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment != pdev->seg ||
@@ -2056,7 +2057,7 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
-    return domain_context_unmap(pdev->domain, devfn, pdev);
+    return 0;
 }
 
 static int __hwdom_init setup_hwdom_device(u8 devfn, struct pci_dev *pdev)
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: drop ownership checking from domain_context_mapping_one()

Despite putting in quite a bit of effort it was not possible to
establish why exactly this code exists (beyond possibly sanity
checking). Instead of a subsequent change further complicating this
logic, simply get rid of it.

Take the opportunity and move the respective unmap_vtd_domain_page() out
of the locked region.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -112,28 +112,6 @@ static int context_set_domain_id(struct
     return 0;
 }
 
-static int context_get_domain_id(struct context_entry *context,
-                                 struct iommu *iommu)
-{
-    unsigned long dom_index, nr_dom;
-    int domid = -1;
-
-    if (iommu && context)
-    {
-        nr_dom = cap_ndoms(iommu->cap);
-
-        dom_index = context_domain_id(*context);
-
-        if ( dom_index < nr_dom && iommu->domid_map )
-            domid = iommu->domid_map[dom_index];
-        else
-            dprintk(XENLOG_DEBUG VTDPREFIX,
-                    "dom_index %lu exceeds nr_dom %lu or iommu has no domid_map\n",
-                    dom_index, nr_dom);
-    }
-    return domid;
-}
-
 static struct intel_iommu *__init alloc_intel_iommu(void)
 {
     struct intel_iommu *intel;
@@ -1433,49 +1411,9 @@ int domain_context_mapping_one(
 
     if ( context_present(*context) )
     {
-        int res = 0;
-
-        /* Try to get domain ownership from device structure.  If that's
-         * not available, try to read it from the context itself. */
-        if ( pdev )
-        {
-            if ( pdev->domain != domain )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX
-                       "d%d: %04x:%02x:%02x.%u owned by d%d!",
-                       domain->domain_id,
-                       seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
-                       pdev->domain ? pdev->domain->domain_id : -1);
-                res = -EINVAL;
-            }
-        }
-        else
-        {
-            int cdomain;
-            cdomain = context_get_domain_id(context, iommu);
-            
-            if ( cdomain < 0 )
-            {
-                printk(XENLOG_G_WARNING VTDPREFIX
-                       "d%d: %04x:%02x:%02x.%u mapped, but can't find owner!\n",
-                       domain->domain_id,
-                       seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
-                res = -EINVAL;
-            }
-            else if ( cdomain != domain->domain_id )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX
-                       "d%d: %04x:%02x:%02x.%u already mapped to d%d!",
-                       domain->domain_id,
-                       seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
-                       cdomain);
-                res = -EINVAL;
-            }
-        }
-
-        unmap_vtd_domain_page(context_entries);
         spin_unlock(&iommu->lock);
-        return res;
+        unmap_vtd_domain_page(context_entries);
+        return 0;
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: re-assign devices directly

Devices with RMRRs, due to it being unspecified how/when the specified
memory regions may get accessed, may not be left disconnected from their
respective mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than unmapping the old context and
then mapping the new one, re-assignment needs to be done in a single
step.

This is CVE-2022-26359 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any RMRRs. The
main difference is when it comes to updating context entries, which need
to be atomic when there are RMRRs. Yet atomicity can only be achieved
with CMPXCHG16B, availability of which we can't take for given.

The seemingly complicated choice of non-negative return values for
domain_context_mapping_one() is to limit code churn: This way callers
passing NULL for pdev don't need fiddling with.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -71,7 +71,8 @@ void free_pgtable_maddr(u64 maddr);
 void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(void *va);
 int domain_context_mapping_one(struct domain *domain, struct iommu *iommu,
-                               u8 bus, u8 devfn, const struct pci_dev *);
+                               uint8_t bus, uint8_t devfn,
+                               const struct pci_dev *pdev, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct iommu *iommu,
                              u8 bus, u8 devfn);
 int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
@@ -91,8 +92,8 @@ int is_igd_vt_enabled_quirk(void);
 void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct iommu* iommu);
 void vtd_ops_postamble_quirk(struct iommu* iommu);
-int __must_check me_wifi_quirk(struct domain *domain,
-                               u8 bus, u8 devfn, int map);
+int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
+                               uint8_t devfn, unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct iommu *iommu);
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -108,6 +108,7 @@ static int context_set_domain_id(struct
     }
 
     set_bit(i, iommu->domid_bitmap);
+    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
     context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
     return 0;
 }
@@ -1391,15 +1392,27 @@ static void __hwdom_init intel_iommu_hwd
     }
 }
 
+/*
+ * This function returns
+ * - a negative errno value upon error,
+ * - zero upon success when previously the entry was non-present, or this isn't
+ *   the "main" request for a device (pdev == NULL), or for no-op quarantining
+ *   assignments,
+ * - positive (one) upon success when previously the entry was present and this
+ *   is the "main" request for a device (pdev != NULL).
+ */
 int domain_context_mapping_one(
     struct domain *domain,
     struct iommu *iommu,
-    u8 bus, u8 devfn, const struct pci_dev *pdev)
+    uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
+    unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
-    struct context_entry *context, *context_entries;
+    struct context_entry *context, *context_entries, lctxt;
+    __uint128_t old;
     u64 maddr, pgd_maddr;
-    u16 seg = iommu->intel->drhd->segment;
+    uint16_t seg = iommu->intel->drhd->segment, prev_did = 0;
+    struct domain *prev_dom = NULL;
     int agaw, rc, ret;
     bool_t flush_dev_iotlb;
 
@@ -1408,17 +1421,32 @@ int domain_context_mapping_one(
     maddr = bus_to_context_maddr(iommu, bus);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
     context = &context_entries[devfn];
+    old = (lctxt = *context).full;
 
-    if ( context_present(*context) )
+    if ( context_present(lctxt) )
     {
-        spin_unlock(&iommu->lock);
-        unmap_vtd_domain_page(context_entries);
-        return 0;
+        domid_t domid;
+
+        prev_did = context_domain_id(lctxt);
+        domid = iommu->domid_map[prev_did];
+        if ( domid < DOMID_FIRST_RESERVED )
+            prev_dom = rcu_lock_domain_by_id(domid);
+        else if ( domid == DOMID_IO )
+            prev_dom = rcu_lock_domain(dom_io);
+        if ( !prev_dom )
+        {
+            spin_unlock(&iommu->lock);
+            unmap_vtd_domain_page(context_entries);
+            dprintk(XENLOG_DEBUG VTDPREFIX,
+                    "no domain for did %u (nr_dom %u)\n",
+                    prev_did, cap_ndoms(iommu->cap));
+            return -ESRCH;
+        }
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
     {
-        context_set_translation_type(*context, CONTEXT_TT_PASS_THRU);
+        context_set_translation_type(lctxt, CONTEXT_TT_PASS_THRU);
         agaw = level_to_agaw(iommu->nr_pt_levels);
     }
     else
@@ -1435,6 +1463,8 @@ int domain_context_mapping_one(
                 spin_unlock(&hd->arch.mapping_lock);
                 spin_unlock(&iommu->lock);
                 unmap_vtd_domain_page(context_entries);
+                if ( prev_dom )
+                    rcu_unlock_domain(prev_dom);
                 return -ENOMEM;
             }
         }
@@ -1452,33 +1482,102 @@ int domain_context_mapping_one(
                 goto nomem;
         }
 
-        context_set_address_root(*context, pgd_maddr);
+        context_set_address_root(lctxt, pgd_maddr);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
-            context_set_translation_type(*context, CONTEXT_TT_DEV_IOTLB);
+            context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
-            context_set_translation_type(*context, CONTEXT_TT_MULTI_LEVEL);
+            context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL);
 
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    if ( context_set_domain_id(context, domain, iommu) )
+    rc = context_set_domain_id(&lctxt, domain, iommu);
+    if ( rc )
     {
+    unlock:
         spin_unlock(&iommu->lock);
         unmap_vtd_domain_page(context_entries);
-        return -EFAULT;
+        if ( prev_dom )
+            rcu_unlock_domain(prev_dom);
+        return rc;
+    }
+
+    if ( !prev_dom )
+    {
+        context_set_address_width(lctxt, agaw);
+        context_set_fault_enable(lctxt);
+        context_set_present(lctxt);
+    }
+    else if ( prev_dom == domain )
+    {
+        ASSERT(lctxt.full == context->full);
+        rc = !!pdev;
+        goto unlock;
+    }
+    else
+    {
+        ASSERT(context_address_width(lctxt) == agaw);
+        ASSERT(!context_fault_disable(lctxt));
+    }
+
+    if ( cpu_has_cx16 )
+    {
+        __uint128_t res = cmpxchg16b(context, &old, &lctxt.full);
+
+        /*
+         * Hardware does not update the context entry behind our backs,
+         * so the return value should match "old".
+         */
+        if ( res != old )
+        {
+            if ( pdev )
+                check_cleanup_domid_map(domain, pdev, iommu);
+            printk(XENLOG_ERR
+                   "%04x:%02x:%02x.%u: unexpected context entry %016lx_%016lx (expected %016lx_%016lx)\n",
+                   pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                   (uint64_t)(res >> 64), (uint64_t)res,
+                   (uint64_t)(old >> 64), (uint64_t)old);
+            rc = -EILSEQ;
+            goto unlock;
+        }
+    }
+    else if ( !prev_dom || !(mode & MAP_WITH_RMRR) )
+    {
+        context_clear_present(*context);
+        iommu_sync_cache(context, sizeof(*context));
+
+        write_atomic(&context->hi, lctxt.hi);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->lo, lctxt.lo);
+    }
+    else /* Best effort, updating DID last. */
+    {
+         /*
+          * By non-atomically updating the context entry's DID field last,
+          * during a short window in time TLB entries with the old domain ID
+          * but the new page tables may be inserted.  This could affect I/O
+          * of other devices using this same (old) domain ID.  Such updating
+          * therefore is not a problem if this was the only device associated
+          * with the old domain ID.  Diverting I/O of any of a dying domain's
+          * devices to the quarantine page tables is intended anyway.
+          */
+        if ( !(mode & (MAP_OWNER_DYING | MAP_SINGLE_DEVICE)) )
+            printk(XENLOG_WARNING VTDPREFIX
+                   " %04x:%02x:%02x.%u: reassignment may cause %pd data corruption\n",
+                   seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), prev_dom);
+
+        write_atomic(&context->lo, lctxt.lo);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->hi, lctxt.hi);
     }
 
-    context_set_address_width(*context, agaw);
-    context_set_fault_enable(*context);
-    context_set_present(*context);
     iommu_sync_cache(context, sizeof(struct context_entry));
     spin_unlock(&iommu->lock);
 
-    /* Context entry was previously non-present (with domid 0). */
-    rc = iommu_flush_context_device(iommu, 0, PCI_BDF2(bus, devfn),
-                                    DMA_CCMD_MASK_NOBIT, 1);
+    rc = iommu_flush_context_device(iommu, prev_did, PCI_BDF2(bus, devfn),
+                                    DMA_CCMD_MASK_NOBIT, !prev_dom);
     flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-    ret = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
+    ret = iommu_flush_iotlb_dsi(iommu, prev_did, !prev_dom, flush_dev_iotlb);
 
     /*
      * The current logic for returns:
@@ -1499,17 +1598,35 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, mode);
 
-    return rc;
+    if ( rc )
+    {
+        if ( !prev_dom )
+            domain_context_unmap_one(domain, iommu, bus, devfn);
+        else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+            domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                       mode & MAP_WITH_RMRR);
+    }
+
+    if ( prev_dom )
+        rcu_unlock_domain(prev_dom);
+
+    return rc ?: pdev && prev_dom;
 }
 
+static int domain_context_unmap(struct domain *d, uint8_t devfn,
+                                struct pci_dev *pdev);
+
 static int domain_context_mapping(struct domain *domain, u8 devfn,
                                   struct pci_dev *pdev)
 {
     struct acpi_drhd_unit *drhd;
+    const struct acpi_rmrr_unit *rmrr;
     int ret = 0;
-    u8 seg = pdev->seg, bus = pdev->bus, secbus;
+    unsigned int i, mode = 0;
+    uint16_t seg = pdev->seg, bdf;
+    uint8_t bus = pdev->bus, secbus;
 
     drhd = acpi_find_matched_drhd_unit(pdev);
     if ( !drhd )
@@ -1517,8 +1634,30 @@ static int domain_context_mapping(struct
 
     ASSERT(pcidevs_locked());
 
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment != pdev->seg ||
+             bdf != PCI_BDF2(pdev->bus, pdev->devfn) )
+            continue;
+
+        mode |= MAP_WITH_RMRR;
+        break;
+    }
+
+    if ( domain != pdev->domain )
+    {
+        if ( pdev->domain->is_dying )
+            mode |= MAP_OWNER_DYING;
+        else if ( drhd &&
+                  !any_pdev_behind_iommu(pdev->domain, pdev, drhd->iommu) &&
+                  !pdev->phantom_stride )
+            mode |= MAP_SINGLE_DEVICE;
+    }
+
     switch ( pdev->type )
     {
+        bool prev_present;
+
     case DEV_TYPE_PCI_HOST_BRIDGE:
         if ( iommu_debug )
             printk(VTDPREFIX "d%d:Hostbridge: skip %04x:%02x:%02x.%u map\n",
@@ -1539,7 +1678,9 @@ static int domain_context_mapping(struct
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
+                                         pdev, mode);
+        if ( ret > 0 )
+            ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             enable_ats_device(pdev, &drhd->iommu->ats_devices);
 
@@ -1552,20 +1693,33 @@ static int domain_context_mapping(struct
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
-        if ( ret )
+                                         pdev, mode);
+        if ( ret < 0 )
             break;
+        prev_present = ret;
+        ret = 0;
 
         if ( find_upstream_bridge(seg, &bus, &devfn, &secbus) < 1 )
             break;
 
         /*
+         * Strictly speaking if the device is the only one behind this bridge
+         * and the only one with this (secbus,0,0) tuple, it could be allowed
+         * to be re-assigned regardless of RMRR presence.  But let's deal with
+         * that case only if it is actually found in the wild.
+         */
+        if ( prev_present && (mode & MAP_WITH_RMRR) &&
+             domain != pdev->domain )
+            ret = -EOPNOTSUPP;
+
+        /*
          * Mapping a bridge should, if anything, pass the struct pci_dev of
          * that bridge. Since bridges don't normally get assigned to guests,
          * their owner would be the wrong one. Pass NULL instead.
          */
-        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         NULL);
+        if ( ret >= 0 )
+            ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
+                                             NULL, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1580,7 +1734,15 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL);
+                                             NULL, mode);
+
+        if ( ret )
+        {
+            if ( !prev_present )
+                domain_context_unmap(domain, devfn, pdev);
+            else if ( pdev->domain != domain ) /* Avoid infinite recursion. */
+                domain_context_mapping(pdev->domain, devfn, pdev);
+        }
 
         break;
 
@@ -2291,9 +2453,8 @@ static int reassign_device_ownership(
 {
     int ret;
 
-    ret = domain_context_unmap(source, devfn, pdev);
-    if ( ret )
-        return ret;
+    if ( !has_arch_pdevs(target) )
+        vmx_pi_hooks_assign(target);
 
     /*
      * Devices assigned to untrusted domains (here assumed to be any domU)
@@ -2303,6 +2464,31 @@ static int reassign_device_ownership(
     if ( (target != hardware_domain) && !iommu_intremap )
         untrusted_msi = true;
 
+    ret = domain_context_mapping(target, devfn, pdev);
+    if ( ret )
+    {
+        if ( !has_arch_pdevs(target) )
+            vmx_pi_hooks_deassign(target);
+        return ret;
+    }
+
+    if ( pdev->devfn == devfn )
+    {
+        const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+
+        if ( drhd )
+            check_cleanup_domid_map(source, pdev, drhd->iommu);
+    }
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->arch.pdev_list);
+        pdev->domain = target;
+    }
+
+    if ( !has_arch_pdevs(source) )
+        vmx_pi_hooks_deassign(source);
+
     /*
      * If the device belongs to the hardware domain, and it has RMRR, don't
      * remove it from the hardware domain, because BIOS may use RMRR at
@@ -2331,34 +2517,7 @@ static int reassign_device_ownership(
             }
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->arch.pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    if ( !has_arch_pdevs(source) )
-        vmx_pi_hooks_deassign(source);
-
-    if ( !has_arch_pdevs(target) )
-        vmx_pi_hooks_assign(target);
-
-    ret = domain_context_mapping(target, devfn, pdev);
-    if ( ret )
-    {
-        if ( !has_arch_pdevs(target) )
-            vmx_pi_hooks_deassign(target);
-
-        return ret;
-    }
-
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->arch.pdev_list);
-        pdev->domain = target;
-    }
-
-    return ret;
+    return 0;
 }
 
 static int intel_iommu_assign_device(
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -201,8 +201,12 @@ struct root_entry {
     do {(root).val |= ((value) & PAGE_MASK_4K);} while(0)
 
 struct context_entry {
-    u64 lo;
-    u64 hi;
+    union {
+        struct {
+            uint64_t lo, hi;
+        };
+        __uint128_t full;
+    };
 };
 #define ROOT_ENTRY_NR (PAGE_SIZE_4K/sizeof(struct root_entry))
 #define context_present(c) ((c).lo & 1)
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -330,7 +330,8 @@ void __init platform_quirks_init(void)
  */
 
 static int __must_check map_me_phantom_function(struct domain *domain,
-                                                u32 dev, int map)
+                                                unsigned int dev,
+                                                unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
     struct pci_dev *pdev;
@@ -341,9 +342,9 @@ static int __must_check map_me_phantom_f
     drhd = acpi_find_matched_drhd_unit(pdev);
 
     /* map or unmap ME phantom function */
-    if ( map )
+    if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL);
+                                        PCI_DEVFN(dev, 7), NULL, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
                                       PCI_DEVFN(dev, 7));
@@ -351,7 +352,8 @@ static int __must_check map_me_phantom_f
     return rc;
 }
 
-int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
+int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
+                  unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -375,7 +377,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, map);
+                rc = map_me_phantom_function(domain, 3, mode);
                 break;
             default:
                 break;
@@ -401,7 +403,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, map);
+                rc = map_me_phantom_function(domain, 22, mode);
                 break;
             default:
                 break;
--- a/xen/drivers/passthrough/vtd/vtd.h
+++ b/xen/drivers/passthrough/vtd/vtd.h
@@ -22,8 +22,14 @@
 
 #include <xen/iommu.h>
 
-#define MAP_ME_PHANTOM_FUNC      1
-#define UNMAP_ME_PHANTOM_FUNC    0
+/*
+ * Values for domain_context_mapping_one()'s and me_wifi_quirk()'s "mode"
+ * parameters.
+ */
+#define MAP_WITH_RMRR         (1u << 0)
+#define MAP_OWNER_DYING       (1u << 1)
+#define MAP_SINGLE_DEVICE     (1u << 2)
+#define UNMAP_ME_PHANTOM_FUNC (1u << 3)
 
 /* Allow for both IOAPIC and IOSAPIC. */
 #define IO_xAPIC_route_entry IO_APIC_route_entry
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-assign devices directly

Devices with unity map ranges, due to it being unspecified how/when
these memory ranges may get accessed, may not be left disconnected from
their unity mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than tearing down the old root page
table pointer and then establishing the new one, re-assignment needs to
be done in a single step.

This is CVE-2022-26360 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any unity map
ranges.  The main difference is when it comes to updating DTEs, which need
to be atomic when there are unity mappings. Yet atomicity can only be
achieved with CMPXCHG16B, availability of which we can't take for given.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -78,8 +78,12 @@ void amd_iommu_share_p2m(struct domain *
 int get_dma_requestor_id(u16 seg, u16 bdf);
 void amd_iommu_set_intremap_table(
     u32 *dte, u64 intremap_ptr, u8 int_valid);
-void amd_iommu_set_root_page_table(
-    u32 *dte, u64 root_ptr, u16 domain_id, u8 paging_mode, u8 valid);
+#define SET_ROOT_VALID          (1u << 0)
+#define SET_ROOT_WITH_UNITY_MAP (1u << 1)
+int __must_check amd_iommu_set_root_page_table(
+    uint32_t *dte, uint64_t root_ptr, uint16_t domain_id, uint8_t paging_mode,
+    unsigned int flags);
+paddr_t amd_iommu_get_root_page_table(const uint32_t *dte);
 void iommu_dte_set_iotlb(u32 *dte, u8 i);
 void iommu_dte_add_device_entry(u32 *dte, struct ivrs_mappings *ivrs_dev);
 void iommu_dte_set_guest_cr3(u32 *dte, u16 dom_id, u64 gcr3,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -158,12 +158,105 @@ static unsigned int set_iommu_pte_presen
     return flush_flags;
 }
 
-void amd_iommu_set_root_page_table(uint32_t *dte, uint64_t root_ptr,
-                                   uint16_t domain_id, uint8_t paging_mode,
-                                   uint8_t valid)
+/*
+ * This function returns
+ * - -errno for errors,
+ * - 0 for a successful update, atomic when necessary
+ * - 1 for a successful but non-atomic update, which may need to be warned
+ *   about by the caller.
+ */
+int amd_iommu_set_root_page_table(uint32_t *dte, uint64_t root_ptr,
+                                  uint16_t domain_id, uint8_t paging_mode,
+                                  unsigned int flags)
 {
+    bool valid = flags & SET_ROOT_VALID;
     uint32_t addr_hi, addr_lo, entry, dte0 = dte[0];
 
+    addr_lo = root_ptr & DMA_32BIT_MASK;
+    addr_hi = root_ptr >> 32;
+
+    if ( get_field_from_reg_u32(dte0, IOMMU_DEV_TABLE_VALID_MASK,
+                                IOMMU_DEV_TABLE_VALID_SHIFT) &&
+         get_field_from_reg_u32(dte0, IOMMU_DEV_TABLE_TRANSLATION_VALID_MASK,
+                                IOMMU_DEV_TABLE_TRANSLATION_VALID_SHIFT) &&
+         (cpu_has_cx16 || (flags & SET_ROOT_WITH_UNITY_MAP)) )
+    {
+        union {
+            uint32_t dte[4];
+            uint64_t raw64[2];
+            __uint128_t raw128;
+        } ldte;
+        __uint128_t old;
+        int ret = 0;
+
+        memcpy(ldte.dte, dte, sizeof(ldte));
+        old = ldte.raw128;
+
+        set_field_in_reg_u32(domain_id, ldte.dte[2],
+                             IOMMU_DEV_TABLE_DOMAIN_ID_MASK,
+                             IOMMU_DEV_TABLE_DOMAIN_ID_SHIFT, &ldte.dte[2]);
+
+        set_field_in_reg_u32(addr_hi, ldte.dte[1],
+                             IOMMU_DEV_TABLE_PAGE_TABLE_PTR_HIGH_MASK,
+                             IOMMU_DEV_TABLE_PAGE_TABLE_PTR_HIGH_SHIFT,
+                             &ldte.dte[1]);
+        set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, ldte.dte[1],
+                             IOMMU_DEV_TABLE_IO_WRITE_PERMISSION_MASK,
+                             IOMMU_DEV_TABLE_IO_WRITE_PERMISSION_SHIFT,
+                             &ldte.dte[1]);
+        set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, ldte.dte[1],
+                             IOMMU_DEV_TABLE_IO_READ_PERMISSION_MASK,
+                             IOMMU_DEV_TABLE_IO_READ_PERMISSION_SHIFT,
+                             &ldte.dte[1]);
+
+        set_field_in_reg_u32(addr_lo >> PAGE_SHIFT, ldte.dte[0],
+                             IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_MASK,
+                             IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_SHIFT,
+                             &ldte.dte[0]);
+        set_field_in_reg_u32(paging_mode, ldte.dte[0],
+                             IOMMU_DEV_TABLE_PAGING_MODE_MASK,
+                             IOMMU_DEV_TABLE_PAGING_MODE_SHIFT, &ldte.dte[0]);
+        set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, ldte.dte[0],
+                             IOMMU_DEV_TABLE_TRANSLATION_VALID_MASK,
+                             IOMMU_DEV_TABLE_TRANSLATION_VALID_SHIFT,
+                             &ldte.dte[0]);
+        set_field_in_reg_u32(valid ? IOMMU_CONTROL_ENABLED
+                                   : IOMMU_CONTROL_DISABLED,
+                             ldte.dte[0], IOMMU_DEV_TABLE_VALID_MASK,
+                             IOMMU_DEV_TABLE_VALID_SHIFT, &ldte.dte[0]);
+
+        if ( cpu_has_cx16 )
+        {
+            __uint128_t res = cmpxchg16b(dte, &old, &ldte.raw128);
+
+            /*
+             * Hardware does not update the DTE behind our backs, so the
+             * return value should match "old".
+             */
+            if ( res != old )
+            {
+                printk(XENLOG_ERR
+                       "Dom%d: unexpected DTE %016lx_%016lx (expected %016lx_%016lx)\n",
+                       domain_id,
+                       (uint64_t)(res >> 64), (uint64_t)res,
+                       (uint64_t)(old >> 64), (uint64_t)old);
+                ret = -EILSEQ;
+            }
+        }
+        else /* Best effort, updating domain_id last. */
+        {
+            uint64_t *ptr = (void *)dte;
+
+            write_atomic(ptr + 0, ldte.raw64[0]);
+            /* No barrier should be needed between these two. */
+            write_atomic(ptr + 1, ldte.raw64[1]);
+
+            ret = 1;
+        }
+
+        return ret;
+    }
+
     if ( valid ||
          get_field_from_reg_u32(dte0, IOMMU_DEV_TABLE_VALID_MASK,
                                 IOMMU_DEV_TABLE_VALID_SHIFT) )
@@ -183,9 +276,6 @@ void amd_iommu_set_root_page_table(uint3
                          IOMMU_DEV_TABLE_DOMAIN_ID_SHIFT, &entry);
     dte[2] = entry;
 
-    addr_lo = root_ptr & DMA_32BIT_MASK;
-    addr_hi = root_ptr >> 32;
-
     set_field_in_reg_u32(addr_hi, 0,
                          IOMMU_DEV_TABLE_PAGE_TABLE_PTR_HIGH_MASK,
                          IOMMU_DEV_TABLE_PAGE_TABLE_PTR_HIGH_SHIFT, &entry);
@@ -212,6 +302,20 @@ void amd_iommu_set_root_page_table(uint3
                          IOMMU_DEV_TABLE_VALID_MASK,
                          IOMMU_DEV_TABLE_VALID_SHIFT, &entry);
     write_atomic(&dte[0], entry);
+
+    return 0;
+}
+
+paddr_t amd_iommu_get_root_page_table(const uint32_t *dte)
+{
+    uint32_t lo = get_field_from_reg_u32(
+                      dte[0], IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_MASK,
+                      IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_SHIFT);
+    uint32_t hi = get_field_from_reg_u32(
+                      dte[1], IOMMU_DEV_TABLE_PAGE_TABLE_PTR_HIGH_MASK,
+                      IOMMU_DEV_TABLE_PAGE_TABLE_PTR_HIGH_SHIFT);
+
+    return ((paddr_t)hi << 32) | (lo << PAGE_SHIFT);
 }
 
 void iommu_dte_set_iotlb(uint32_t *dte, uint8_t i)
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -109,22 +109,60 @@ static void disable_translation(u32 *dte
     dte[0] = entry;
 }
 
-static void amd_iommu_setup_domain_device(
+static int __must_check allocate_domain_resources(struct domain_iommu *hd)
+{
+    int rc;
+
+    spin_lock(&hd->arch.mapping_lock);
+    rc = amd_iommu_alloc_root(hd);
+    spin_unlock(&hd->arch.mapping_lock);
+
+    return rc;
+}
+
+static bool any_pdev_behind_iommu(const struct domain *d,
+                                  const struct pci_dev *exclude,
+                                  const struct amd_iommu *iommu)
+{
+    const struct pci_dev *pdev;
+
+    for_each_pdev ( d, pdev )
+    {
+        if ( pdev == exclude )
+            continue;
+
+        if ( find_iommu_for_device(pdev->seg,
+                                   PCI_BDF2(pdev->bus, pdev->devfn)) == iommu )
+            return true;
+    }
+
+    return false;
+}
+
+static int __must_check amd_iommu_setup_domain_device(
     struct domain *domain, struct amd_iommu *iommu,
     u8 devfn, struct pci_dev *pdev)
 {
-    void *dte;
+    uint32_t *dte;
     unsigned long flags;
-    int req_id, valid = 1;
-    int dte_i = 0;
+    unsigned int req_id, sr_flags;
+    int dte_i = 0, rc;
     u8 bus = pdev->bus;
-    const struct domain_iommu *hd = dom_iommu(domain);
+    struct domain_iommu *hd = dom_iommu(domain);
+    const struct ivrs_mappings *ivrs_dev;
+
+    BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer);
 
-    BUG_ON( !hd->arch.root_table || !hd->arch.paging_mode ||
-            !iommu->dev_table.buffer );
+    rc = allocate_domain_resources(hd);
+    if ( rc )
+        return rc;
 
-    if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
-        valid = 0;
+    req_id = get_dma_requestor_id(iommu->seg,
+                                  PCI_BDF2(pdev->bus, pdev->devfn));
+    ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
+    sr_flags = (iommu_hwdom_passthrough && is_hardware_domain(domain)
+                ? 0 : SET_ROOT_VALID)
+               | (ivrs_dev->unity_map ? SET_ROOT_WITH_UNITY_MAP : 0);
 
     if ( ats_enabled )
         dte_i = 1;
@@ -132,32 +170,87 @@ static void amd_iommu_setup_domain_devic
     /* get device-table entry */
     req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
     dte = iommu->dev_table.buffer + (req_id * IOMMU_DEV_TABLE_ENTRY_SIZE);
+    ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
     spin_lock_irqsave(&iommu->lock, flags);
 
     if ( !is_translation_valid((u32 *)dte) )
     {
         /* bind DTE to domain page-tables */
-        amd_iommu_set_root_page_table(
-            (u32 *)dte, page_to_maddr(hd->arch.root_table), domain->domain_id,
-            hd->arch.paging_mode, valid);
+        rc = amd_iommu_set_root_page_table(
+                 dte, page_to_maddr(hd->arch.root_table),
+                 domain->domain_id, hd->arch.paging_mode, sr_flags);
+        if ( rc )
+        {
+            ASSERT(rc < 0);
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
 
         if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
              iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) )
             iommu_dte_set_iotlb((u32 *)dte, dte_i);
 
         amd_iommu_flush_device(iommu, req_id);
+    }
+    else if ( amd_iommu_get_root_page_table(dte) !=
+              page_to_maddr(hd->arch.root_table) )
+    {
+        /*
+         * Strictly speaking if the device is the only one with this requestor
+         * ID, it could be allowed to be re-assigned regardless of unity map
+         * presence.  But let's deal with that case only if it is actually
+         * found in the wild.
+         */
+        if ( req_id != PCI_BDF2(bus, devfn) &&
+             (sr_flags & SET_ROOT_WITH_UNITY_MAP) )
+            rc = -EOPNOTSUPP;
+        else
+            rc = amd_iommu_set_root_page_table(
+                     dte, page_to_maddr(hd->arch.root_table),
+                     domain->domain_id, hd->arch.paging_mode, sr_flags);
+        if ( rc < 0 )
+        {
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
+        if ( rc &&
+             domain != pdev->domain &&
+             /*
+              * By non-atomically updating the DTE's domain ID field last,
+              * during a short window in time TLB entries with the old domain
+              * ID but the new page tables may have been inserted.  This could
+              * affect I/O of other devices using this same (old) domain ID.
+              * Such updating therefore is not a problem if this was the only
+              * device associated with the old domain ID.  Diverting I/O of any
+              * of a dying domain's devices to the quarantine page tables is
+              * intended anyway.
+              */
+             !pdev->domain->is_dying &&
+             (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
+              pdev->phantom_stride) )
+            printk(" %04x:%02x:%02x.%u: reassignment may cause %pd data corruption\n",
+                   pdev->seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                   pdev->domain);
 
-        AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
-                        "root table = %#"PRIx64", "
-                        "domain = %d, paging mode = %d\n",
-                        req_id, pdev->type,
-                        page_to_maddr(hd->arch.root_table),
-                        domain->domain_id, hd->arch.paging_mode);
+        if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
+             iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) )
+            ASSERT(get_field_from_reg_u32(
+                       dte[3], IOMMU_DEV_TABLE_IOTLB_SUPPORT_MASK,
+                       IOMMU_DEV_TABLE_IOTLB_SUPPORT_SHIFT) == dte_i);
+
+        amd_iommu_flush_device(iommu, req_id);
     }
 
     spin_unlock_irqrestore(&iommu->lock, flags);
 
+    AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
+                    "root table = %#"PRIx64", "
+                    "domain = %d, paging mode = %d\n",
+                    req_id, pdev->type,
+                    page_to_maddr(hd->arch.root_table),
+                    domain->domain_id, hd->arch.paging_mode);
+
     ASSERT(pcidevs_locked());
 
     if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
@@ -168,6 +261,8 @@ static void amd_iommu_setup_domain_devic
 
         amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
     }
+
+    return 0;
 }
 
 int __init amd_iov_detect(void)
@@ -211,17 +306,6 @@ int amd_iommu_alloc_root(struct domain_i
     return 0;
 }
 
-static int __must_check allocate_domain_resources(struct domain_iommu *hd)
-{
-    int rc;
-
-    spin_lock(&hd->arch.mapping_lock);
-    rc = amd_iommu_alloc_root(hd);
-    spin_unlock(&hd->arch.mapping_lock);
-
-    return rc;
-}
-
 int __read_mostly amd_iommu_min_paging_mode = 1;
 
 static int amd_iommu_domain_init(struct domain *d)
@@ -310,7 +394,6 @@ static int reassign_device(struct domain
 {
     struct amd_iommu *iommu;
     int bdf, rc;
-    struct domain_iommu *t = dom_iommu(target);
     const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
 
     bdf = PCI_BDF2(pdev->bus, pdev->devfn);
@@ -324,7 +407,15 @@ static int reassign_device(struct domain
         return -ENODEV;
     }
 
-    amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
+    rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
+    if ( rc )
+        return rc;
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->arch.pdev_list);
+        pdev->domain = target;
+    }
 
     /*
      * If the device belongs to the hardware domain, and it has a unity mapping,
@@ -340,27 +431,10 @@ static int reassign_device(struct domain
             return rc;
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->arch.pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    rc = allocate_domain_resources(t);
-    if ( rc )
-        return rc;
-
-    amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
     AMD_IOMMU_DEBUG("Re-assign %04x:%02x:%02x.%u from dom%d to dom%d\n",
                     pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                     source->domain_id, target->domain_id);
 
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->arch.pdev_list);
-        pdev->domain = target;
-    }
-
     return 0;
 }
 
@@ -491,8 +565,7 @@ static int amd_iommu_add_device(u8 devfn
         return -ENODEV;
     }
 
-    amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
-    return 0;
+    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part I)

Arrange for domain ID and page table root to be passed around, the latter in
particular to domain_pgd_maddr() such that taking it from the per-domain
fields can be overridden.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -72,9 +72,10 @@ void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(void *va);
 int domain_context_mapping_one(struct domain *domain, struct iommu *iommu,
                                uint8_t bus, uint8_t devfn,
-                               const struct pci_dev *pdev, unsigned int mode);
+                               const struct pci_dev *pdev, domid_t domid,
+                               paddr_t pgd_maddr, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct iommu *iommu,
-                             u8 bus, u8 devfn);
+                             uint8_t bus, uint8_t devfn, domid_t domid);
 int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 
 unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
@@ -93,7 +94,8 @@ void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct iommu* iommu);
 void vtd_ops_postamble_quirk(struct iommu* iommu);
 int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
-                               uint8_t devfn, unsigned int mode);
+                               uint8_t devfn, domid_t domid, paddr_t pgd_maddr,
+                               unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct iommu *iommu);
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1405,12 +1405,12 @@ int domain_context_mapping_one(
     struct domain *domain,
     struct iommu *iommu,
     uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
-    unsigned int mode)
+    domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     struct context_entry *context, *context_entries, lctxt;
     __uint128_t old;
-    u64 maddr, pgd_maddr;
+    uint64_t maddr;
     uint16_t seg = iommu->intel->drhd->segment, prev_did = 0;
     struct domain *prev_dom = NULL;
     int agaw, rc, ret;
@@ -1451,10 +1451,12 @@ int domain_context_mapping_one(
     }
     else
     {
+        paddr_t root = pgd_maddr;
+
         spin_lock(&hd->arch.mapping_lock);
 
         /* Ensure we have pagetables allocated down to leaf PTE. */
-        if ( hd->arch.pgd_maddr == 0 )
+        if ( !root )
         {
             addr_to_dma_page_maddr(domain, 0, 1);
             if ( hd->arch.pgd_maddr == 0 )
@@ -1467,22 +1469,24 @@ int domain_context_mapping_one(
                     rcu_unlock_domain(prev_dom);
                 return -ENOMEM;
             }
+
+            root = hd->arch.pgd_maddr;
         }
 
         /* Skip top levels of page tables for 2- and 3-level DRHDs. */
-        pgd_maddr = hd->arch.pgd_maddr;
         for ( agaw = level_to_agaw(4);
               agaw != level_to_agaw(iommu->nr_pt_levels);
               agaw-- )
         {
-            struct dma_pte *p = map_vtd_domain_page(pgd_maddr);
-            pgd_maddr = dma_pte_addr(*p);
+            struct dma_pte *p = map_vtd_domain_page(root);
+
+            root = dma_pte_addr(*p);
             unmap_vtd_domain_page(p);
-            if ( pgd_maddr == 0 )
+            if ( !root )
                 goto nomem;
         }
 
-        context_set_address_root(lctxt, pgd_maddr);
+        context_set_address_root(lctxt, root);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
             context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
@@ -1598,15 +1602,21 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, mode);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode);
 
     if ( rc )
     {
         if ( !prev_dom )
-            domain_context_unmap_one(domain, iommu, bus, devfn);
+            domain_context_unmap_one(domain, iommu, bus, devfn,
+                                     domain->domain_id);
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+        {
+            hd = dom_iommu(prev_dom);
             domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                       domain->domain_id,
+                                       hd->arch.pgd_maddr,
                                        mode & MAP_WITH_RMRR);
+        }
     }
 
     if ( prev_dom )
@@ -1623,6 +1633,7 @@ static int domain_context_mapping(struct
 {
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
+    paddr_t pgd_maddr = dom_iommu(domain)->arch.pgd_maddr;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1678,7 +1689,8 @@ static int domain_context_mapping(struct
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret > 0 )
             ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
@@ -1693,7 +1705,8 @@ static int domain_context_mapping(struct
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1719,7 +1732,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1734,7 +1748,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         if ( ret )
         {
@@ -1763,7 +1778,7 @@ static int domain_context_mapping(struct
 int domain_context_unmap_one(
     struct domain *domain,
     struct iommu *iommu,
-    u8 bus, u8 devfn)
+    uint8_t bus, uint8_t devfn, domid_t domid)
 {
     struct context_entry *context, *context_entries;
     u64 maddr;
@@ -1821,7 +1836,7 @@ int domain_context_unmap_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !iommu->intel->drhd->segment && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, 0, UNMAP_ME_PHANTOM_FUNC);
 
     return rc;
 }
@@ -1860,7 +1875,8 @@ static int domain_context_unmap(struct d
             printk(VTDPREFIX "d%d:PCIe: unmap %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1870,7 +1886,8 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "d%d:PCI: unmap %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( ret )
             break;
 
@@ -1882,14 +1899,17 @@ static int domain_context_unmap(struct d
         /* PCIe to PCI/PCIx bridge */
         if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
         {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
             if ( ret )
                 return ret;
 
-            ret = domain_context_unmap_one(domain, iommu, secbus, 0);
+            ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                           domain->domain_id);
         }
         else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
 
         break;
 
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -331,6 +331,8 @@ void __init platform_quirks_init(void)
 
 static int __must_check map_me_phantom_function(struct domain *domain,
                                                 unsigned int dev,
+                                                domid_t domid,
+                                                paddr_t pgd_maddr,
                                                 unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
@@ -344,16 +346,17 @@ static int __must_check map_me_phantom_f
     /* map or unmap ME phantom function */
     if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL, mode);
+                                        PCI_DEVFN(dev, 7), NULL,
+                                        domid, pgd_maddr, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
-                                      PCI_DEVFN(dev, 7));
+                                      PCI_DEVFN(dev, 7), domid);
 
     return rc;
 }
 
 int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
-                  unsigned int mode)
+                  domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -377,7 +380,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, mode);
+                rc = map_me_phantom_function(domain, 3, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
@@ -403,7 +406,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, mode);
+                rc = map_me_phantom_function(domain, 22, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part II)

Replace the passing of struct domain * by domid_t in preparation of
per-device quarantine page tables also requiring per-device pseudo
domain IDs, which aren't going to be associated with any struct domain
instances.

No functional change intended (except for slightly adjusted log message
text).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -52,8 +52,8 @@ static struct tasklet vtd_fault_tasklet;
 static int setup_hwdom_device(u8 devfn, struct pci_dev *);
 static void setup_hwdom_rmrr(struct domain *d);
 
-static int domain_iommu_domid(struct domain *d,
-                              struct iommu *iommu)
+static int get_iommu_did(domid_t domid, const struct iommu *iommu,
+                         bool warn)
 {
     unsigned long nr_dom, i;
 
@@ -61,23 +61,24 @@ static int domain_iommu_domid(struct dom
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
             return i;
 
         i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
     }
 
-    dprintk(XENLOG_ERR VTDPREFIX,
-            "Cannot get valid iommu domid: domid=%d iommu->index=%d\n",
-            d->domain_id, iommu->index);
+    if ( warn )
+        dprintk(XENLOG_ERR VTDPREFIX,
+                "No valid iommu %u domid for Dom%d\n",
+                iommu->index, domid);
+
     return -1;
 }
 
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
 static int context_set_domain_id(struct context_entry *context,
-                                 struct domain *d,
-                                 struct iommu *iommu)
+                                 domid_t domid, struct iommu *iommu)
 {
     unsigned long nr_dom, i;
     int found = 0;
@@ -88,7 +89,7 @@ static int context_set_domain_id(struct
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
         {
             found = 1;
             break;
@@ -104,7 +105,7 @@ static int context_set_domain_id(struct
             dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: no free domain ids\n");
             return -EFAULT;
         }
-        iommu->domid_map[i] = d->domain_id;
+        iommu->domid_map[i] = domid;
     }
 
     set_bit(i, iommu->domid_bitmap);
@@ -131,9 +132,9 @@ static void __init free_intel_iommu(stru
     xfree(intel);
 }
 
-static void cleanup_domid_map(struct domain *domain, struct iommu *iommu)
+static void cleanup_domid_map(domid_t domid, struct iommu *iommu)
 {
-    int iommu_domid = domain_iommu_domid(domain, iommu);
+    int iommu_domid = get_iommu_did(domid, iommu, false);
 
     if ( iommu_domid >= 0 )
     {
@@ -189,7 +190,7 @@ static void check_cleanup_domid_map(stru
     if ( !found )
     {
         clear_bit(iommu->index, &dom_iommu(d)->arch.iommu_bitmap);
-        cleanup_domid_map(d, iommu);
+        cleanup_domid_map(d->domain_id, iommu);
     }
 }
 
@@ -670,7 +671,7 @@ static int __must_check iommu_flush_iotl
             continue;
 
         flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-        iommu_domid= domain_iommu_domid(d, iommu);
+        iommu_domid = get_iommu_did(d->domain_id, iommu, !d->is_dying);
         if ( iommu_domid == -1 )
             continue;
 
@@ -1495,7 +1496,7 @@ int domain_context_mapping_one(
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    rc = context_set_domain_id(&lctxt, domain, iommu);
+    rc = context_set_domain_id(&lctxt, domid, iommu);
     if ( rc )
     {
     unlock:
@@ -1803,7 +1804,7 @@ int domain_context_unmap_one(
     context_clear_entry(*context);
     iommu_sync_cache(context, sizeof(struct context_entry));
 
-    iommu_domid= domain_iommu_domid(domain, iommu);
+    iommu_domid = get_iommu_did(domid, iommu, !domain->is_dying);
     if ( iommu_domid == -1 )
     {
         spin_unlock(&iommu->lock);
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: maintain a per-device pseudo domain ID

In order to subsequently enable per-device quarantine page tables, we'll
need domain-ID-like identifiers to be inserted in the respective device
(AMD) or context (Intel) table entries alongside the per-device page
table root addresses.

Make use of "real" domain IDs occupying only half of the value range
coverable by domid_t.

Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
leaks in case of error, but existing ones don't get plugged - that'll be
the subject of a later change.

The VT-d changes are slightly asymmetric, but this way we can avoid
assigning pseudo domain IDs to devices which would never be mapped while
still avoiding to add a new parameter to domain_context_unmap().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -112,6 +112,10 @@ int pi_update_irte(const struct pi_desc
         ops->sync_cache(addr, size);                    \
 })
 
+unsigned long *iommu_init_domid(void);
+domid_t iommu_alloc_domid(unsigned long *map);
+void iommu_free_domid(domid_t domid, unsigned long *map);
+
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
  * Local variables:
--- a/xen/include/asm-x86/pci.h
+++ b/xen/include/asm-x86/pci.h
@@ -15,6 +15,12 @@
 
 struct arch_pci_dev {
     vmask_t used_vectors;
+    /*
+     * These fields are (de)initialized under pcidevs-lock. Other uses of
+     * them don't race (de)initialization and hence don't strictly need any
+     * locking.
+     */
+    domid_t pseudo_domid;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
--- a/xen/include/asm-x86/amd-iommu.h
+++ b/xen/include/asm-x86/amd-iommu.h
@@ -97,6 +97,7 @@ struct amd_iommu {
     struct ring_buffer cmd_buffer;
     struct ring_buffer event_log;
     struct ring_buffer ppr_log;
+    unsigned long *domid_map;
 
     int exclusion_enable;
     int exclusion_allow_all;
--- a/xen/drivers/passthrough/amd/iommu_detect.c
+++ b/xen/drivers/passthrough/amd/iommu_detect.c
@@ -150,6 +150,11 @@ int __init amd_iommu_detect_one_acpi(
     if ( rt )
         goto out;
 
+    iommu->domid_map = iommu_init_domid();
+    rt = -ENOMEM;
+    if ( !iommu->domid_map )
+        goto out;
+
     rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
     if ( rt )
         printk(XENLOG_ERR
@@ -161,7 +166,10 @@ int __init amd_iommu_detect_one_acpi(
 
  out:
     if ( rt )
+    {
+        xfree(iommu->domid_map);
         xfree(iommu);
+    }
 
     return rt;
 }
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -541,6 +541,8 @@ static int amd_iommu_add_device(u8 devfn
 {
     struct amd_iommu *iommu;
     u16 bdf;
+    bool fresh_domid = false;
+    int ret;
 
     if ( !pdev->domain )
         return -EINVAL;
@@ -565,7 +567,22 @@ static int amd_iommu_add_device(u8 devfn
         return -ENODEV;
     }
 
-    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( iommu_quarantine && pdev->arch.pseudo_domid == DOMID_INVALID )
+    {
+        pdev->arch.pseudo_domid = iommu_alloc_domid(iommu->domid_map);
+        if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+            return -ENOSPC;
+        fresh_domid = true;
+    }
+
+    ret = amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( ret && fresh_domid )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
+    return ret;
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
@@ -587,6 +604,10 @@ static int amd_iommu_remove_device(u8 de
     }
 
     amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
+
+    iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+    pdev->arch.pseudo_domid = DOMID_INVALID;
+
     return 0;
 }
 
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -343,6 +343,7 @@ static struct pci_dev *alloc_pdev(struct
     *((u8*) &pdev->bus) = bus;
     *((u8*) &pdev->devfn) = devfn;
     pdev->domain = NULL;
+    pdev->arch.pseudo_domid = DOMID_INVALID;
     INIT_LIST_HEAD(&pdev->msi_list);
 
     if ( pci_find_cap_offset(pseg->nr, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
@@ -1311,10 +1312,13 @@ static int _dump_pci_devices(struct pci_
 
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
     {
-        printk("%04x:%02x:%02x.%u - dom %-3d - node %-3d - MSIs < ",
-               pseg->nr, pdev->bus,
-               PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
-               pdev->domain ? pdev->domain->domain_id : -1,
+        printk("%04x:%02x:%02x.%u - ", pseg->nr, pdev->bus,
+               PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+        if ( pdev->domain == dom_io )
+            printk("DomIO:%x", pdev->arch.pseudo_domid);
+        else if ( pdev->domain )
+            printk("Dom%d", pdev->domain->domain_id);
+        printk(" - node %-3d - MSIs < ",
                (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
         list_for_each_entry ( msi, &pdev->msi_list, list )
                printk("%d ", msi->irq);
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -22,6 +22,7 @@
 #include <xen/sched.h>
 #include <xen/xmalloc.h>
 #include <xen/domain_page.h>
+#include <xen/err.h>
 #include <xen/iocap.h>
 #include <xen/iommu.h>
 #include <xen/numa.h>
@@ -1234,7 +1235,7 @@ int __init iommu_alloc(struct acpi_drhd_
 {
     struct iommu *iommu;
     unsigned long sagaw, nr_dom;
-    int agaw;
+    int agaw, rc;
 
     if ( nr_iommus > MAX_IOMMUS )
     {
@@ -1324,10 +1325,19 @@ int __init iommu_alloc(struct acpi_drhd_
     if ( !iommu->domid_map )
         return -ENOMEM ;
 
+    iommu->pseudo_domid_map = iommu_init_domid();
+    rc = -ENOMEM;
+    if ( !iommu->pseudo_domid_map )
+        goto free;
+
     spin_lock_init(&iommu->lock);
     spin_lock_init(&iommu->register_lock);
 
     return 0;
+
+ free:
+    iommu_free(drhd);
+    return rc;
 }
 
 void __init iommu_free(struct acpi_drhd_unit *drhd)
@@ -1350,6 +1360,7 @@ void __init iommu_free(struct acpi_drhd_
 
     xfree(iommu->domid_bitmap);
     xfree(iommu->domid_map);
+    xfree(iommu->pseudo_domid_map);
 
     free_intel_iommu(iommu->intel);
     if ( iommu->msi.irq >= 0 )
@@ -1626,8 +1637,8 @@ int domain_context_mapping_one(
     return rc ?: pdev && prev_dom;
 }
 
-static int domain_context_unmap(struct domain *d, uint8_t devfn,
-                                struct pci_dev *pdev);
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *d, uint8_t devfn, struct pci_dev *pdev);
 
 static int domain_context_mapping(struct domain *domain, u8 devfn,
                                   struct pci_dev *pdev)
@@ -1635,6 +1646,7 @@ static int domain_context_mapping(struct
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
     paddr_t pgd_maddr = dom_iommu(domain)->arch.pgd_maddr;
+    domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1685,6 +1697,14 @@ static int domain_context_mapping(struct
         break;
 
     case DEV_TYPE_PCIe_ENDPOINT:
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "d%d:PCIe: map %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus,
@@ -1700,6 +1720,14 @@ static int domain_context_mapping(struct
         break;
 
     case DEV_TYPE_PCI:
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "d%d:PCI: map %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus,
@@ -1773,6 +1801,13 @@ static int domain_context_mapping(struct
     if ( !ret && devfn == pdev->devfn )
         pci_vtd_quirk(pdev);
 
+    if ( ret && drhd && orig_domid == DOMID_INVALID )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return ret;
 }
 
@@ -1842,8 +1877,10 @@ int domain_context_unmap_one(
     return rc;
 }
 
-static int domain_context_unmap(struct domain *domain, u8 devfn,
-                                struct pci_dev *pdev)
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *domain,
+    uint8_t devfn,
+    struct pci_dev *pdev)
 {
     struct acpi_drhd_unit *drhd;
     struct iommu *iommu;
@@ -1852,7 +1889,7 @@ static int domain_context_unmap(struct d
 
     drhd = acpi_find_matched_drhd_unit(pdev);
     if ( !drhd )
-        return -ENODEV;
+        return ERR_PTR(-ENODEV);
     iommu = drhd->iommu;
 
     switch ( pdev->type )
@@ -1863,7 +1900,7 @@ static int domain_context_unmap(struct d
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
         if ( !is_hardware_domain(domain) )
-            return -EPERM;
+            return ERR_PTR(-EPERM);
         goto out;
 
     case DEV_TYPE_PCIe_BRIDGE:
@@ -1902,11 +1939,9 @@ static int domain_context_unmap(struct d
         {
             ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
                                            domain->domain_id);
-            if ( ret )
-                return ret;
-
-            ret = domain_context_unmap_one(domain, iommu, secbus, 0,
-                                           domain->domain_id);
+            if ( !ret )
+                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                               domain->domain_id);
         }
         else /* Legacy PCI bridge */
             ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
@@ -1926,7 +1961,7 @@ static int domain_context_unmap(struct d
         check_cleanup_domid_map(domain, pdev, iommu);
 
 out:
-    return ret;
+    return ret ? ERR_PTR(ret) : drhd;
 }
 
 static void iommu_domain_teardown(struct domain *d)
@@ -2152,16 +2187,17 @@ static int intel_iommu_enable_device(str
 
 static int intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
 {
+    const struct acpi_drhd_unit *drhd;
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int ret, i;
+    unsigned int i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_unmap(pdev->domain, devfn, pdev);
-    if ( ret )
-        return ret;
+    drhd = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( IS_ERR(drhd) )
+        return PTR_ERR(drhd);
 
     for_each_rmrr_device ( rmrr, bdf, i )
     {
@@ -2178,6 +2214,13 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
+    if ( drhd )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return 0;
 }
 
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -541,6 +541,7 @@ struct iommu {
     struct msi_desc msi;
     struct intel_iommu *intel;
     struct list_head ats_devices;
+    unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */
     unsigned long *domid_bitmap;  /* domain id bitmap */
     u16 *domid_map;               /* domain id mapping array */
 };
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -373,6 +373,53 @@ void __hwdom_init arch_iommu_hwdom_init(
         return;
 }
 
+unsigned long *__init iommu_init_domid(void)
+{
+    if ( !iommu_quarantine )
+        return ZERO_BLOCK_PTR;
+
+    BUILD_BUG_ON(DOMID_MASK * 2U >= UINT16_MAX);
+
+    return xzalloc_array(unsigned long,
+                         BITS_TO_LONGS(UINT16_MAX - DOMID_MASK));
+}
+
+domid_t iommu_alloc_domid(unsigned long *map)
+{
+    /*
+     * This is used uniformly across all IOMMUs, such that on typical
+     * systems we wouldn't re-use the same ID very quickly (perhaps never).
+     */
+    static unsigned int start;
+    unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start);
+
+    ASSERT(pcidevs_locked());
+
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK);
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        return DOMID_INVALID;
+
+    __set_bit(idx, map);
+
+    start = idx + 1;
+
+    return idx | (DOMID_MASK + 1);
+}
+
+void iommu_free_domid(domid_t domid, unsigned long *map)
+{
+    ASSERT(pcidevs_locked());
+
+    if ( domid == DOMID_INVALID )
+        return;
+
+    ASSERT(domid > DOMID_MASK);
+
+    if ( !__test_and_clear_bit(domid & DOMID_MASK, map) )
+        BUG();
+}
+
 /*
  * Local variables:
  * mode: C
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -584,6 +584,9 @@ DEFINE_XEN_GUEST_HANDLE(mmuext_op_t);
 /* Idle domain. */
 #define DOMID_IDLE           xen_mk_uint(0x7FFF)
 
+/* Mask for valid domain id values */
+#define DOMID_MASK           xen_mk_uint(0x7FFF)
+
 #ifndef __ASSEMBLY__
 
 typedef uint16_t domid_t;
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: drop TLB flushes from quarantine_init() hooks

The page tables just created aren't hooked up yet anywhere, so there's
nothing that could be present in any TLB, and hence nothing to flush.
Dropping this flush is, at least on the VT-d side, a prereq to per-
device domain ID use when quarantining devices, as dom_io isn't going
to be assigned a DID anymore: The warning in get_iommu_did() would
trigger.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -840,8 +840,6 @@ int __init amd_iommu_quarantine_init(str
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    amd_iommu_flush_all_pages(d);
-
     /* Pages leaked in failure case */
     return level ? -ENOMEM : 0;
 }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2858,7 +2858,6 @@ static int __init intel_iommu_quarantine
     struct dma_pte *parent;
     unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
     unsigned int level = agaw_to_level(agaw);
-    int rc;
 
     if ( hd->arch.pgd_maddr )
     {
@@ -2905,10 +2904,8 @@ static int __init intel_iommu_quarantine
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    rc = iommu_flush_iotlb_all(d);
-
     /* Pages leaked in failure case */
-    return level ? -ENOMEM : rc;
+    return level ? -ENOMEM : 0;
 }
 
 const struct iommu_ops __initconstrel intel_iommu_ops = {
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: abstract maximum number of page table levels

We will want to use the constant elsewhere.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -185,7 +185,7 @@ static inline int amd_iommu_get_paging_m
     while ( max_frames > PTE_PER_TABLE_SIZE )
     {
         max_frames = PTE_PER_TABLE_ALIGN(max_frames) >> PTE_PER_TABLE_SHIFT;
-        if ( ++level > 6 )
+        if ( ++level > IOMMU_MAX_PT_LEVELS )
             return -ENOMEM;
     }
 
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h
@@ -114,6 +114,8 @@
 #define IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_MASK		0xFFFFF000
 #define IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_SHIFT	12
 
+#define IOMMU_MAX_PT_LEVELS 6
+
 /* DeviceTable Entry[63:32] */
 #define IOMMU_DEV_TABLE_GV_SHIFT                    23
 #define IOMMU_DEV_TABLE_GV_MASK                     0x800000
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -477,7 +477,7 @@ static int iommu_pde_from_dfn(struct dom
     table = hd->arch.root_table;
     level = hd->arch.paging_mode;
 
-    BUG_ON( table == NULL || level < 1 || level > 6 );
+    BUG_ON( table == NULL || level < 1 || level > IOMMU_MAX_PT_LEVELS );
 
     /*
      * A frame number past what the current page tables can represent can't
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: use per-device page tables for quarantining

Devices with RMRRs / unity mapped regions, due to it being unspecified
how/when these memory regions may be accessed, may not be left
disconnected from the mappings of these regions (as long as it's not
certain that the device has been fully quiesced). Hence even the page
tables used when quarantining such devices need to have mappings of
those regions. This implies installing page tables in the first place
even when not in scratch-page quarantining mode.

This is CVE-2022-26361 / part of XSA-400.

While for the purpose here it would be sufficient to have devices with
RMRRs / unity mapped regions use per-device page tables, extend this to
all devices (in scratch-page quarantining mode). This allows the leaf
pages to be mapped r/w, thus covering also memory writes (rather than
just reads) issued by non-quiescent devices.

Set up quarantine page tables as late as possible, yet early enough to
not encounter failure during de-assign. This means setup generally
happens in assign_device(), while (for now) the one in deassign_device()
is there mainly to be on the safe side.

In VT-d's DID allocation function don't require the IOMMU lock to be
held anymore: All involved code paths hold pcidevs_lock, so this way we
avoid the need to acquire the IOMMU lock around the new call to
context_set_domain_id().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1424,7 +1424,7 @@ int set_identity_p2m_entry(struct domain
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int ret;
 
-    if ( !paging_mode_translate(p2m->domain) )
+    if ( !paging_mode_translate(d) )
     {
         if ( !has_iommu_pt(d) )
             return 0;
--- a/xen/include/asm-x86/pci.h
+++ b/xen/include/asm-x86/pci.h
@@ -1,6 +1,8 @@
 #ifndef __X86_PCI_H__
 #define __X86_PCI_H__
 
+#include <xen/mm.h>
+
 #define CF8_BDF(cf8)     (  ((cf8) & 0x00ffff00) >> 8)
 #define CF8_ADDR_LO(cf8) (   (cf8) & 0x000000fc)
 #define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
@@ -20,7 +22,18 @@ struct arch_pci_dev {
      * them don't race (de)initialization and hence don't strictly need any
      * locking.
      */
+    union {
+        /* Subset of struct arch_iommu's fields, to be used in dom_io. */
+        struct {
+            uint64_t pgd_maddr;
+        } vtd;
+        struct {
+            struct page_info *root_table;
+        } amd;
+    };
     domid_t pseudo_domid;
+    mfn_t leaf_mfn;
+    struct page_list_head pgtables_list;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -51,7 +51,8 @@ void get_iommu_features(struct amd_iommu
 int amd_iommu_init(void);
 int amd_iommu_update_ivrs_mapping_acpi(void);
 
-int amd_iommu_quarantine_init(struct domain *d);
+int amd_iommu_quarantine_init(struct pci_dev *pdev);
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev);
 
 /* mapping functions */
 int __must_check amd_iommu_map_page(struct domain *d, dfn_t dfn,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -784,64 +784,150 @@ void amd_iommu_share_p2m(struct domain *
     }
 }
 
-int __init amd_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(uint64_t *this, unsigned int level,
+                    struct page_info *pgs[IOMMU_MAX_PT_LEVELS],
+                    struct pci_dev *pdev)
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_PER_TABLE_SIZE; ++i )
+    {
+        uint32_t *pte = (uint32_t *)&this[i];
+        uint64_t *next;
+
+        if ( !get_field_from_reg_u32(pte[0], IOMMU_PTE_PRESENT_MASK,
+                                     IOMMU_PTE_PRESENT_SHIFT) )
+        {
+            if ( !pgs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                pgs[level] = alloc_amd_iommu_pgtable();
+                if ( !pgs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                page_list_add(pgs[level], &pdev->arch.pgtables_list);
+
+                if ( level )
+                {
+                    next = __map_domain_page(pgs[level]);
+                    rc = fill_qpt(next, level - 1, pgs, pdev);
+                    unmap_domain_page(next);
+                }
+            }
+
+            /*
+             * PDEs are essentially a subset of PTEs, so this function
+             * is fine to use even at the leaf.
+             */
+            set_iommu_pde_present(pte, mfn_x(page_to_mfn(pgs[level])), level,
+                                  true, true);
+        }
+        else if ( level &&
+                  get_field_from_reg_u32(pte[0],
+                                         IOMMU_PDE_NEXT_LEVEL_MASK,
+                                         IOMMU_PDE_NEXT_LEVEL_SHIFT) )
+        {
+            paddr_t addr_hi = get_field_from_reg_u32(pte[1],
+                                                     IOMMU_PTE_ADDR_HIGH_MASK,
+                                                     IOMMU_PTE_ADDR_HIGH_SHIFT);
+            paddr_t addr_lo = get_field_from_reg_u32(pte[0],
+                                                     IOMMU_PTE_ADDR_LOW_MASK,
+                                                     IOMMU_PTE_ADDR_LOW_SHIFT);
+            unsigned long mfn = (addr_hi << (32 - PAGE_SHIFT)) | addr_lo;
+
+            page_list_add(mfn_to_page(_mfn(mfn)), &pdev->arch.pgtables_list);
+            next = map_domain_page(_mfn(mfn));
+            rc = fill_qpt(next, level - 1, pgs, pdev);
+            unmap_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+int amd_iommu_quarantine_init(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
     unsigned long end_gfn =
         1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT);
     unsigned int level = amd_iommu_get_paging_mode(end_gfn);
-    uint64_t *table;
+    unsigned int req_id = get_dma_requestor_id(pdev->seg,
+                                               PCI_BDF2(pdev->bus, pdev->devfn));
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
+    int rc;
 
-    if ( hd->arch.root_table )
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.root_table);
+
+    ASSERT(pdev->arch.pseudo_domid != DOMID_INVALID);
+
+    if ( pdev->arch.amd.root_table )
     {
-        ASSERT_UNREACHABLE();
+        clear_domain_page(pdev->arch.leaf_mfn);
         return 0;
     }
 
-    spin_lock(&hd->arch.mapping_lock);
-
-    hd->arch.root_table = alloc_amd_iommu_pgtable();
-    if ( !hd->arch.root_table )
-        goto out;
-
-    table = __map_domain_page(hd->arch.root_table);
-    while ( level )
+    pdev->arch.amd.root_table = alloc_amd_iommu_pgtable();
+    if ( !pdev->arch.amd.root_table )
+        return -ENOMEM;
+
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
+    hd->arch.root_table = pdev->arch.amd.root_table;
+
+    rc = amd_iommu_reserve_domain_unity_map(dom_io,
+                                            ivrs_mappings[req_id].unity_map,
+                                            0);
+
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.root_table = NULL;
+
+    if ( rc )
+        printk("%04x:%02x:%02x.%u: quarantine unity mapping failed\n",
+               pdev->seg, pdev->bus,
+               PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+    else
     {
-        struct page_info *pg;
-        unsigned int i;
+        uint64_t *root;
+        struct page_info *pgs[IOMMU_MAX_PT_LEVELS] = {};
 
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        pg = alloc_amd_iommu_pgtable();
-        if ( !pg )
-            break;
+        spin_lock(&hd->arch.mapping_lock);
 
-        for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
-        {
-            uint32_t *pde = (uint32_t *)&table[i];
+        root = __map_domain_page(pdev->arch.amd.root_table);
+        rc = fill_qpt(root, level - 1, pgs, pdev);
+        unmap_domain_page(root);
 
-            /*
-             * PDEs are essentially a subset of PTEs, so this function
-             * is fine to use even at the leaf.
-             */
-            set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level - 1,
-                                  false, true);
-        }
+        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
 
-        unmap_domain_page(table);
-        table = __map_domain_page(pg);
-        level--;
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_domain_page(table);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    if ( rc )
+        amd_iommu_quarantine_teardown(pdev);
+
+    return rc;
+}
+
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
+{
+    struct page_info *pg;
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.amd.root_table )
+        return;
+
+    while ( (pg = page_list_remove_head(&pdev->arch.pgtables_list)) )
+        free_amd_iommu_pgtable(pg);
 
-    /* Pages leaked in failure case */
-    return level ? -ENOMEM : 0;
+    pdev->arch.amd.root_table = NULL;
 }
 
 /*
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -150,6 +150,8 @@ static int __must_check amd_iommu_setup_
     u8 bus = pdev->bus;
     struct domain_iommu *hd = dom_iommu(domain);
     const struct ivrs_mappings *ivrs_dev;
+    const struct page_info *root_pg;
+    domid_t domid;
 
     BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer);
 
@@ -172,14 +174,25 @@ static int __must_check amd_iommu_setup_
     dte = iommu->dev_table.buffer + (req_id * IOMMU_DEV_TABLE_ENTRY_SIZE);
     ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
+    if ( domain != dom_io )
+    {
+        root_pg = hd->arch.root_table;
+        domid = domain->domain_id;
+    }
+    else
+    {
+        root_pg = pdev->arch.amd.root_table;
+        domid = pdev->arch.pseudo_domid;
+    }
+
     spin_lock_irqsave(&iommu->lock, flags);
 
     if ( !is_translation_valid((u32 *)dte) )
     {
         /* bind DTE to domain page-tables */
         rc = amd_iommu_set_root_page_table(
-                 dte, page_to_maddr(hd->arch.root_table),
-                 domain->domain_id, hd->arch.paging_mode, sr_flags);
+                 dte, page_to_maddr(root_pg), domid,
+                 hd->arch.paging_mode, sr_flags);
         if ( rc )
         {
             ASSERT(rc < 0);
@@ -193,8 +206,7 @@ static int __must_check amd_iommu_setup_
 
         amd_iommu_flush_device(iommu, req_id);
     }
-    else if ( amd_iommu_get_root_page_table(dte) !=
-              page_to_maddr(hd->arch.root_table) )
+    else if ( amd_iommu_get_root_page_table(dte) != page_to_maddr(root_pg) )
     {
         /*
          * Strictly speaking if the device is the only one with this requestor
@@ -207,8 +219,8 @@ static int __must_check amd_iommu_setup_
             rc = -EOPNOTSUPP;
         else
             rc = amd_iommu_set_root_page_table(
-                     dte, page_to_maddr(hd->arch.root_table),
-                     domain->domain_id, hd->arch.paging_mode, sr_flags);
+                     dte, page_to_maddr(root_pg), domid,
+                     hd->arch.paging_mode, sr_flags);
         if ( rc < 0 )
         {
             spin_unlock_irqrestore(&iommu->lock, flags);
@@ -227,6 +239,7 @@ static int __must_check amd_iommu_setup_
               * intended anyway.
               */
              !pdev->domain->is_dying &&
+             pdev->domain != dom_io &&
              (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
               pdev->phantom_stride) )
             printk(" %04x:%02x:%02x.%u: reassignment may cause %pd data corruption\n",
@@ -247,9 +260,8 @@ static int __must_check amd_iommu_setup_
     AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
                     "root table = %#"PRIx64", "
                     "domain = %d, paging mode = %d\n",
-                    req_id, pdev->type,
-                    page_to_maddr(hd->arch.root_table),
-                    domain->domain_id, hd->arch.paging_mode);
+                    req_id, pdev->type, page_to_maddr(root_pg),
+                    domid, hd->arch.paging_mode);
 
     ASSERT(pcidevs_locked());
 
@@ -296,7 +308,7 @@ int __init amd_iov_detect(void)
 
 int amd_iommu_alloc_root(struct domain_iommu *hd)
 {
-    if ( unlikely(!hd->arch.root_table) )
+    if ( unlikely(!hd->arch.root_table) && hd != dom_iommu(dom_io) )
     {
         hd->arch.root_table = alloc_amd_iommu_pgtable();
         if ( !hd->arch.root_table )
@@ -376,7 +388,10 @@ void amd_iommu_disable_domain_device(str
 
         AMD_IOMMU_DEBUG("Disable: device id = %#x, "
                         "domain = %d, paging mode = %d\n",
-                        req_id,  domain->domain_id,
+                        req_id,
+                        get_field_from_reg_u32(((uint32_t *)dte)[2],
+                                               IOMMU_DEV_TABLE_DOMAIN_ID_MASK,
+                                               IOMMU_DEV_TABLE_DOMAIN_ID_SHIFT),
                         dom_iommu(domain)->arch.paging_mode);
     }
     spin_unlock_irqrestore(&iommu->lock, flags);
@@ -605,6 +620,8 @@ static int amd_iommu_remove_device(u8 de
 
     amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
 
+    amd_iommu_quarantine_teardown(pdev);
+
     iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
     pdev->arch.pseudo_domid = DOMID_INVALID;
 
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -497,19 +497,19 @@ int iommu_iotlb_flush_all(struct domain
     return rc;
 }
 
-static int __init iommu_quarantine_init(void)
+int iommu_quarantine_dev_init(device_t *dev)
 {
     const struct domain_iommu *hd = dom_iommu(dom_io);
-    int rc;
-
-    rc = iommu_domain_init(dom_io);
-    if ( rc )
-        return rc;
 
-    if ( !hd->platform_ops->quarantine_init )
+    if ( !iommu_quarantine || !hd->platform_ops->quarantine_init )
         return 0;
 
-    return hd->platform_ops->quarantine_init(dom_io);
+    return hd->platform_ops->quarantine_init(dev);
+}
+
+static int __init iommu_quarantine_init(void)
+{
+    return iommu_domain_init(dom_io);
 }
 
 int __init iommu_setup(void)
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1512,6 +1512,13 @@ static int assign_device(struct domain *
         msixtbl_init(d);
     }
 
+    if ( pdev->domain != dom_io )
+    {
+        rc = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( rc )
+            goto done;
+    }
+
     pdev->fault.count = 0;
 
     if ( (rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag)) )
@@ -1558,9 +1565,16 @@ int deassign_device(struct domain *d, u1
         return -ENODEV;
 
     /* De-assignment from dom_io should de-quarantine the device */
-    target = ((pdev->quarantine || iommu_quarantine) &&
-              pdev->domain != dom_io) ?
-        dom_io : hardware_domain;
+    if ( (pdev->quarantine || iommu_quarantine) && pdev->domain != dom_io )
+    {
+        ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( ret )
+           return ret;
+
+        target = dom_io;
+    }
+    else
+        target = hardware_domain;
 
     while ( pdev->phantom_stride )
     {
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -43,6 +43,12 @@
 #include "vtd.h"
 #include "../ats.h"
 
+#define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
+                                             : (pdev)->arch.pseudo_domid)
+#define DEVICE_PGTABLE(d, pdev) ((d) != dom_io \
+                                 ? dom_iommu(d)->arch.pgd_maddr \
+                                 : (pdev)->arch.vtd.pgd_maddr)
+
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
 
@@ -78,13 +84,18 @@ static int get_iommu_did(domid_t domid,
 
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
+
+/*
+ * This function may have "context" passed as NULL, to merely obtain a DID
+ * for "domid".
+ */
 static int context_set_domain_id(struct context_entry *context,
                                  domid_t domid, struct iommu *iommu)
 {
     unsigned long nr_dom, i;
     int found = 0;
 
-    ASSERT(spin_is_locked(&iommu->lock));
+    ASSERT(pcidevs_locked());
 
     nr_dom = cap_ndoms(iommu->cap);
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
@@ -110,8 +121,13 @@ static int context_set_domain_id(struct
     }
 
     set_bit(i, iommu->domid_bitmap);
-    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
-    context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+
+    if ( context )
+    {
+        context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
+        context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+    }
+
     return 0;
 }
 
@@ -179,8 +195,12 @@ static void check_cleanup_domid_map(stru
                                     const struct pci_dev *exclude,
                                     struct iommu *iommu)
 {
-    bool found = any_pdev_behind_iommu(d, exclude, iommu);
+    bool found;
 
+    if ( d == dom_io )
+        return;
+
+    found = any_pdev_behind_iommu(d, exclude, iommu);
     /*
      * Hidden devices are associated with DomXEN but usable by the hardware
      * domain. Hence they need considering here as well.
@@ -1443,7 +1463,7 @@ int domain_context_mapping_one(
         domid = iommu->domid_map[prev_did];
         if ( domid < DOMID_FIRST_RESERVED )
             prev_dom = rcu_lock_domain_by_id(domid);
-        else if ( domid == DOMID_IO )
+        else if ( pdev ? domid == pdev->arch.pseudo_domid : domid > DOMID_MASK )
             prev_dom = rcu_lock_domain(dom_io);
         if ( !prev_dom )
         {
@@ -1620,15 +1640,12 @@ int domain_context_mapping_one(
     {
         if ( !prev_dom )
             domain_context_unmap_one(domain, iommu, bus, devfn,
-                                     domain->domain_id);
+                                     DEVICE_DOMID(domain, pdev));
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
-        {
-            hd = dom_iommu(prev_dom);
             domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
-                                       domain->domain_id,
-                                       hd->arch.pgd_maddr,
+                                       DEVICE_DOMID(prev_dom, pdev),
+                                       DEVICE_PGTABLE(prev_dom, pdev),
                                        mode & MAP_WITH_RMRR);
-        }
     }
 
     if ( prev_dom )
@@ -1645,7 +1662,7 @@ static int domain_context_mapping(struct
 {
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
-    paddr_t pgd_maddr = dom_iommu(domain)->arch.pgd_maddr;
+    paddr_t pgd_maddr = DEVICE_PGTABLE(domain, pdev);
     domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
@@ -1668,7 +1685,7 @@ static int domain_context_mapping(struct
         break;
     }
 
-    if ( domain != pdev->domain )
+    if ( domain != pdev->domain && pdev->domain != dom_io )
     {
         if ( pdev->domain->is_dying )
             mode |= MAP_OWNER_DYING;
@@ -1709,8 +1726,8 @@ static int domain_context_mapping(struct
             printk(VTDPREFIX "d%d:PCIe: map %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
-        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
+        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, pdev,
+                                         DEVICE_DOMID(domain, pdev), pgd_maddr,
                                          mode);
         if ( ret > 0 )
             ret = 0;
@@ -1734,8 +1751,8 @@ static int domain_context_mapping(struct
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
-                                         mode);
+                                         pdev, DEVICE_DOMID(domain, pdev),
+                                         pgd_maddr, mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1761,8 +1778,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1777,8 +1794,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         if ( ret )
         {
@@ -1914,7 +1931,7 @@ static const struct acpi_drhd_unit *doma
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1925,7 +1942,7 @@ static const struct acpi_drhd_unit *doma
             printk(VTDPREFIX "d%d:PCI: unmap %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( ret )
             break;
 
@@ -1934,18 +1951,12 @@ static const struct acpi_drhd_unit *doma
         if ( find_upstream_bridge(seg, &tmp_bus, &tmp_devfn, &secbus) < 1 )
             break;
 
+        ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                       DEVICE_DOMID(domain, pdev));
         /* PCIe to PCI/PCIx bridge */
-        if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
-        {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
-            if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
-                                               domain->domain_id);
-        }
-        else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
+        if ( !ret && pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
+            ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                           DEVICE_DOMID(domain, pdev));
 
         break;
 
@@ -1992,6 +2003,25 @@ static void iommu_domain_teardown(struct
     spin_unlock(&hd->arch.mapping_lock);
 }
 
+static void quarantine_teardown(struct pci_dev *pdev,
+                                const struct acpi_drhd_unit *drhd)
+{
+    struct page_info *pg;
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.vtd.pgd_maddr )
+        return;
+
+    while ( (pg = page_list_remove_head(&pdev->arch.pgtables_list)) )
+        free_domheap_page(pg);
+
+    pdev->arch.vtd.pgd_maddr = 0;
+
+    if ( drhd )
+        cleanup_domid_map(pdev->arch.pseudo_domid, drhd->iommu);
+}
+
 static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
                                              mfn_t mfn, unsigned int flags,
                                              unsigned int *flush_flags)
@@ -2214,6 +2244,8 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
+    quarantine_teardown(pdev, drhd);
+
     if ( drhd )
     {
         iommu_free_domid(pdev->arch.pseudo_domid,
@@ -2852,60 +2884,139 @@ static void vtd_dump_p2m_table(struct do
     vtd_dump_p2m_table_level(hd->arch.pgd_maddr, agaw_to_level(hd->arch.agaw), 0, 0);
 }
 
-static int __init intel_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(struct dma_pte *this, unsigned int level,
+                    paddr_t maddrs[6], struct pci_dev *pdev)
 {
-    struct domain_iommu *hd = dom_iommu(d);
-    struct dma_pte *parent;
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_NUM; ++i )
+    {
+        struct dma_pte *pte = &this[i], *next;
+
+        if ( !dma_pte_present(*pte) )
+        {
+            if ( !maddrs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                maddrs[level] = alloc_pgtable_maddr(NULL, 1);
+                if ( !maddrs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                page_list_add(maddr_to_page(maddrs[level]),
+                              &pdev->arch.pgtables_list);
+
+                if ( level )
+                {
+                    next = map_vtd_domain_page(maddrs[level]);
+                    rc = fill_qpt(next, level - 1, maddrs, pdev);
+                    unmap_vtd_domain_page(next);
+                }
+            }
+
+            dma_set_pte_addr(*pte, maddrs[level]);
+            dma_set_pte_readable(*pte);
+            dma_set_pte_writable(*pte);
+        }
+        else if ( level && !dma_pte_superpage(*pte) )
+        {
+            page_list_add(maddr_to_page(dma_pte_addr(*pte)),
+                          &pdev->arch.pgtables_list);
+            next = map_vtd_domain_page(dma_pte_addr(*pte));
+            rc = fill_qpt(next, level - 1, maddrs, pdev);
+            unmap_vtd_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+static int intel_iommu_quarantine_init(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    paddr_t maddr;
     unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
     unsigned int level = agaw_to_level(agaw);
+    const struct acpi_drhd_unit *drhd;
+    const struct acpi_rmrr_unit *rmrr;
+    unsigned int i, bdf;
+    bool rmrr_found = false;
+    int rc;
 
-    if ( hd->arch.pgd_maddr )
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.pgd_maddr);
+
+    if ( pdev->arch.vtd.pgd_maddr )
     {
-        ASSERT_UNREACHABLE();
+        clear_domain_page(pdev->arch.leaf_mfn);
         return 0;
     }
 
-    spin_lock(&hd->arch.mapping_lock);
+    drhd = acpi_find_matched_drhd_unit(pdev);
+    if ( !drhd )
+        return -ENODEV;
 
-    hd->arch.pgd_maddr = alloc_pgtable_maddr(NULL, 1);
-    if ( !hd->arch.pgd_maddr )
-        goto out;
+    maddr = alloc_pgtable_maddr(NULL, 1);
+    if ( !maddr )
+        return -ENOMEM;
 
-    parent = map_vtd_domain_page(hd->arch.pgd_maddr);
-    while ( level )
-    {
-        uint64_t maddr;
-        unsigned int offset;
+    rc = context_set_domain_id(NULL, pdev->arch.pseudo_domid, drhd->iommu);
 
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        maddr = alloc_pgtable_maddr(NULL, 1);
-        if ( !maddr )
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
+    hd->arch.pgd_maddr = maddr;
+
+    for_each_rmrr_device ( rmrr, bdf, i )
+    {
+        if ( rc )
             break;
 
-        for ( offset = 0; offset < PTE_NUM; offset++ )
+        if ( rmrr->segment == pdev->seg &&
+             bdf == PCI_BDF2(pdev->bus, pdev->devfn) )
         {
-            struct dma_pte *pte = &parent[offset];
+            rmrr_found = true;
 
-            dma_set_pte_addr(*pte, maddr);
-            dma_set_pte_readable(*pte);
+            rc = iommu_identity_mapping(dom_io, p2m_access_rw,
+                                        rmrr->base_address, rmrr->end_address,
+                                        0);
+            if ( rc )
+                printk(XENLOG_ERR VTDPREFIX
+                       "%04x:%02x:%02x.%u: RMRR quarantine mapping failed\n",
+                       pdev->seg, pdev->bus,
+                       PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
         }
-        iommu_sync_cache(parent, PAGE_SIZE);
+    }
 
-        unmap_vtd_domain_page(parent);
-        parent = map_vtd_domain_page(maddr);
-        level--;
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.pgd_maddr = 0;
+    pdev->arch.vtd.pgd_maddr = maddr;
+
+    if ( !rc )
+    {
+        struct dma_pte *root;
+        paddr_t maddrs[6] = {};
+
+        spin_lock(&hd->arch.mapping_lock);
+
+        root = map_vtd_domain_page(maddr);
+        rc = fill_qpt(root, level - 1, maddrs, pdev);
+        unmap_vtd_domain_page(root);
+
+        pdev->arch.leaf_mfn = maddr_to_mfn(maddrs[0]);
+
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_vtd_domain_page(parent);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    if ( rc )
+        quarantine_teardown(pdev, drhd);
 
-    /* Pages leaked in failure case */
-    return level ? -ENOMEM : 0;
+    return rc;
 }
 
 const struct iommu_ops __initconstrel intel_iommu_ops = {
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -535,7 +535,7 @@ struct iommu {
     u32 nr_pt_levels;
     u64	cap;
     u64	ecap;
-    spinlock_t lock; /* protect context, domain ids */
+    spinlock_t lock; /* protect context */
     spinlock_t register_lock; /* protect iommu register handling */
     u64 root_maddr; /* root entry machine address */
     struct msi_desc msi;
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -189,7 +189,7 @@ typedef int iommu_grdm_t(xen_pfn_t start
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
-    int (*quarantine_init)(struct domain *d);
+    int (*quarantine_init)(device_t *dev);
     int (*add_device)(u8 devfn, device_t *dev);
     int (*enable_device)(device_t *dev);
     int (*remove_device)(u8 devfn, device_t *dev);
@@ -277,6 +277,7 @@ int __must_check iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
 int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
+int iommu_quarantine_dev_init(device_t *dev);
 
 void iommu_share_p2m_table(struct domain *d);
 
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: split domid map cleanup check into a function

This logic will want invoking from elsewhere.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -153,6 +153,51 @@ static void cleanup_domid_map(struct dom
     }
 }
 
+static bool any_pdev_behind_iommu(const struct domain *d,
+                                  const struct pci_dev *exclude,
+                                  const struct vtd_iommu *iommu)
+{
+    const struct pci_dev *pdev;
+
+    for_each_pdev ( d, pdev )
+    {
+        const struct acpi_drhd_unit *drhd;
+
+        if ( pdev == exclude )
+            continue;
+
+        drhd = acpi_find_matched_drhd_unit(pdev);
+        if ( drhd && drhd->iommu == iommu )
+            return true;
+    }
+
+    return false;
+}
+
+/*
+ * If no other devices under the same iommu owned by this domain,
+ * clear iommu in iommu_bitmap and clear domain_id in domid_bitmap.
+ */
+static void check_cleanup_domid_map(struct domain *d,
+                                    const struct pci_dev *exclude,
+                                    struct vtd_iommu *iommu)
+{
+    bool found = any_pdev_behind_iommu(d, exclude, iommu);
+
+    /*
+     * Hidden devices are associated with DomXEN but usable by the hardware
+     * domain. Hence they need considering here as well.
+     */
+    if ( !found && is_hardware_domain(d) )
+        found = any_pdev_behind_iommu(dom_xen, exclude, iommu);
+
+    if ( !found )
+    {
+        clear_bit(iommu->index, &dom_iommu(d)->arch.iommu_bitmap);
+        cleanup_domid_map(d, iommu);
+    }
+}
+
 static int iommus_incoherent;
 
 static void sync_cache(const void *addr, unsigned int size)
@@ -1685,7 +1730,6 @@ static int domain_context_unmap(struct d
     struct vtd_iommu *iommu;
     int ret = 0;
     u8 seg = pdev->seg, bus = pdev->bus, tmp_bus, tmp_devfn, secbus;
-    int found = 0;
 
     drhd = acpi_find_matched_drhd_unit(pdev);
     if ( !drhd )
@@ -1769,28 +1813,8 @@ static int domain_context_unmap(struct d
     if ( ret )
         goto out;
 
-    /*
-     * if no other devices under the same iommu owned by this domain,
-     * clear iommu in iommu_bitmap and clear domain_id in domid_bitmp
-     */
-    for_each_pdev ( domain, pdev )
-    {
-        if ( pdev->seg == seg && pdev->bus == bus && pdev->devfn == devfn )
-            continue;
-
-        drhd = acpi_find_matched_drhd_unit(pdev);
-        if ( drhd && drhd->iommu == iommu )
-        {
-            found = 1;
-            break;
-        }
-    }
-
-    if ( found == 0 )
-    {
-        clear_bit(iommu->index, &dom_iommu(domain)->arch.iommu_bitmap);
-        cleanup_domid_map(domain, iommu);
-    }
+    if ( !ret )
+        check_cleanup_domid_map(domain, pdev, iommu);
 
 out:
     return ret;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix (de)assign ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully updated.

Also adjust a related log message.

This is CVE-2022-26358 / part of XSA-400.

Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2392,6 +2392,10 @@ static int reassign_device_ownership(
 {
     int ret;
 
+    ret = domain_context_unmap(source, devfn, pdev);
+    if ( ret )
+        return ret;
+
     /*
      * Devices assigned to untrusted domains (here assumed to be any domU)
      * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
@@ -2428,10 +2432,6 @@ static int reassign_device_ownership(
             }
     }
 
-    ret = domain_context_unmap(source, devfn, pdev);
-    if ( ret )
-        return ret;
-
     if ( devfn == pdev->devfn && pdev->domain != dom_io )
     {
         list_move(&pdev->domain_list, &dom_io->pdev_list);
@@ -2508,9 +2508,8 @@ static int intel_iommu_assign_device(
         }
     }
 
-    ret = reassign_device_ownership(s, d, devfn, pdev);
-    if ( ret || d == dom_io )
-        return ret;
+    if ( d == dom_io )
+        return reassign_device_ownership(s, d, devfn, pdev);
 
     /* Setup rmrr identity mapping */
     for_each_rmrr_device( rmrr, bdf, i )
@@ -2523,20 +2522,37 @@ static int intel_iommu_assign_device(
                                          rmrr->end_address, flag);
             if ( ret )
             {
-                int rc;
-
-                rc = reassign_device_ownership(d, s, devfn, pdev);
                 printk(XENLOG_G_ERR VTDPREFIX
-                       " cannot map reserved region (%"PRIx64",%"PRIx64"] for Dom%d (%d)\n",
-                       rmrr->base_address, rmrr->end_address,
-                       d->domain_id, ret);
-                if ( rc )
-                {
-                    printk(XENLOG_ERR VTDPREFIX
-                           " failed to reclaim %04x:%02x:%02x.%u from %pd (%d)\n",
-                           seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), d, rc);
-                    domain_crash(d);
-                }
+                       "%pd: cannot map reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, ret);
+                break;
+            }
+        }
+    }
+
+    if ( !ret )
+        ret = reassign_device_ownership(s, d, devfn, pdev);
+
+    /* See reassign_device_ownership() for the hwdom aspect. */
+    if ( !ret || is_hardware_domain(d) )
+        return ret;
+
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment == seg &&
+             PCI_BUS(bdf) == bus &&
+             PCI_DEVFN2(bdf) == devfn )
+        {
+            int rc = iommu_identity_mapping(d, p2m_access_x,
+                                            rmrr->base_address,
+                                            rmrr->end_address, 0);
+
+            if ( rc && rc != -ENOENT )
+            {
+                printk(XENLOG_ERR VTDPREFIX
+                       "%pd: cannot unmap reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, rc);
+                domain_crash(d);
                 break;
             }
         }
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix add/remove ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully cleared.

Also switch to %pd in related log messages.

Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR")
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1993,14 +1993,6 @@ static int intel_iommu_add_device(u8 dev
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_mapping(pdev->domain, devfn, pdev);
-    if ( ret )
-    {
-        dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n",
-                pdev->domain->domain_id);
-        return ret;
-    }
-
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment == pdev->seg &&
@@ -2017,12 +2009,17 @@ static int intel_iommu_add_device(u8 dev
                                          rmrr->base_address, rmrr->end_address,
                                          0);
             if ( ret )
-                dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
-                        pdev->domain->domain_id);
+                dprintk(XENLOG_ERR VTDPREFIX, "%pd: RMRR mapping failed\n",
+                        pdev->domain);
         }
     }
 
-    return 0;
+    ret = domain_context_mapping(pdev->domain, devfn, pdev);
+    if ( ret )
+        dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n",
+                pdev->domain);
+
+    return ret;
 }
 
 static int intel_iommu_enable_device(struct pci_dev *pdev)
@@ -2044,11 +2041,15 @@ static int intel_iommu_remove_device(u8
 {
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int i;
+    int ret, i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
+    ret = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( ret )
+        return ret;
+
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment != pdev->seg ||
@@ -2064,7 +2065,7 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
-    return domain_context_unmap(pdev->domain, devfn, pdev);
+    return 0;
 }
 
 static int __hwdom_init setup_hwdom_device(u8 devfn, struct pci_dev *pdev)
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: drop ownership checking from domain_context_mapping_one()

Despite putting in quite a bit of effort it was not possible to
establish why exactly this code exists (beyond possibly sanity
checking). Instead of a subsequent change further complicating this
logic, simply get rid of it.

Take the opportunity and move the respective unmap_vtd_domain_page() out
of the locked region.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -114,28 +114,6 @@ static int context_set_domain_id(struct
     return 0;
 }
 
-static int context_get_domain_id(struct context_entry *context,
-                                 struct vtd_iommu *iommu)
-{
-    unsigned long dom_index, nr_dom;
-    int domid = -1;
-
-    if (iommu && context)
-    {
-        nr_dom = cap_ndoms(iommu->cap);
-
-        dom_index = context_domain_id(*context);
-
-        if ( dom_index < nr_dom && iommu->domid_map )
-            domid = iommu->domid_map[dom_index];
-        else
-            dprintk(XENLOG_DEBUG VTDPREFIX,
-                    "dom_index %lu exceeds nr_dom %lu or iommu has no domid_map\n",
-                    dom_index, nr_dom);
-    }
-    return domid;
-}
-
 static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
 {
     int iommu_domid = domain_iommu_domid(domain, iommu);
@@ -1392,49 +1370,9 @@ int domain_context_mapping_one(
 
     if ( context_present(*context) )
     {
-        int res = 0;
-
-        /* Try to get domain ownership from device structure.  If that's
-         * not available, try to read it from the context itself. */
-        if ( pdev )
-        {
-            if ( pdev->domain != domain )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX
-                       "d%d: %04x:%02x:%02x.%u owned by d%d!",
-                       domain->domain_id,
-                       seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
-                       pdev->domain ? pdev->domain->domain_id : -1);
-                res = -EINVAL;
-            }
-        }
-        else
-        {
-            int cdomain;
-            cdomain = context_get_domain_id(context, iommu);
-            
-            if ( cdomain < 0 )
-            {
-                printk(XENLOG_G_WARNING VTDPREFIX
-                       "d%d: %04x:%02x:%02x.%u mapped, but can't find owner!\n",
-                       domain->domain_id,
-                       seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
-                res = -EINVAL;
-            }
-            else if ( cdomain != domain->domain_id )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX
-                       "d%d: %04x:%02x:%02x.%u already mapped to d%d!",
-                       domain->domain_id,
-                       seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
-                       cdomain);
-                res = -EINVAL;
-            }
-        }
-
-        unmap_vtd_domain_page(context_entries);
         spin_unlock(&iommu->lock);
-        return res;
+        unmap_vtd_domain_page(context_entries);
+        return 0;
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: re-assign devices directly

Devices with RMRRs, due to it being unspecified how/when the specified
memory regions may get accessed, may not be left disconnected from their
respective mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than unmapping the old context and
then mapping the new one, re-assignment needs to be done in a single
step.

This is CVE-2022-26359 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any RMRRs. The
main difference is when it comes to updating context entries, which need
to be atomic when there are RMRRs. Yet atomicity can only be achieved
with CMPXCHG16B, availability of which we can't take for given.

The seemingly complicated choice of non-negative return values for
domain_context_mapping_one() is to limit code churn: This way callers
passing NULL for pdev don't need fiddling with.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -85,7 +85,8 @@ void free_pgtable_maddr(u64 maddr);
 void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(void *va);
 int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
-                               u8 bus, u8 devfn, const struct pci_dev *);
+                               uint8_t bus, uint8_t devfn,
+                               const struct pci_dev *pdev, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
                              u8 bus, u8 devfn);
 int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
@@ -105,8 +106,8 @@ int is_igd_vt_enabled_quirk(void);
 void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
-int __must_check me_wifi_quirk(struct domain *domain,
-                               u8 bus, u8 devfn, int map);
+int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
+                               uint8_t devfn, unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -110,6 +110,7 @@ static int context_set_domain_id(struct
     }
 
     set_bit(i, iommu->domid_bitmap);
+    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
     context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
     return 0;
 }
@@ -1350,15 +1351,27 @@ static void __hwdom_init intel_iommu_hwd
     }
 }
 
+/*
+ * This function returns
+ * - a negative errno value upon error,
+ * - zero upon success when previously the entry was non-present, or this isn't
+ *   the "main" request for a device (pdev == NULL), or for no-op quarantining
+ *   assignments,
+ * - positive (one) upon success when previously the entry was present and this
+ *   is the "main" request for a device (pdev != NULL).
+ */
 int domain_context_mapping_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
-    u8 bus, u8 devfn, const struct pci_dev *pdev)
+    uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
+    unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
-    struct context_entry *context, *context_entries;
+    struct context_entry *context, *context_entries, lctxt;
+    __uint128_t old;
     u64 maddr, pgd_maddr;
-    u16 seg = iommu->drhd->segment;
+    uint16_t seg = iommu->drhd->segment, prev_did = 0;
+    struct domain *prev_dom = NULL;
     int agaw, rc, ret;
     bool_t flush_dev_iotlb;
 
@@ -1367,17 +1380,32 @@ int domain_context_mapping_one(
     maddr = bus_to_context_maddr(iommu, bus);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
     context = &context_entries[devfn];
+    old = (lctxt = *context).full;
 
-    if ( context_present(*context) )
+    if ( context_present(lctxt) )
     {
-        spin_unlock(&iommu->lock);
-        unmap_vtd_domain_page(context_entries);
-        return 0;
+        domid_t domid;
+
+        prev_did = context_domain_id(lctxt);
+        domid = iommu->domid_map[prev_did];
+        if ( domid < DOMID_FIRST_RESERVED )
+            prev_dom = rcu_lock_domain_by_id(domid);
+        else if ( domid == DOMID_IO )
+            prev_dom = rcu_lock_domain(dom_io);
+        if ( !prev_dom )
+        {
+            spin_unlock(&iommu->lock);
+            unmap_vtd_domain_page(context_entries);
+            dprintk(XENLOG_DEBUG VTDPREFIX,
+                    "no domain for did %u (nr_dom %u)\n",
+                    prev_did, cap_ndoms(iommu->cap));
+            return -ESRCH;
+        }
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
     {
-        context_set_translation_type(*context, CONTEXT_TT_PASS_THRU);
+        context_set_translation_type(lctxt, CONTEXT_TT_PASS_THRU);
         agaw = level_to_agaw(iommu->nr_pt_levels);
     }
     else
@@ -1394,6 +1422,8 @@ int domain_context_mapping_one(
                 spin_unlock(&hd->arch.mapping_lock);
                 spin_unlock(&iommu->lock);
                 unmap_vtd_domain_page(context_entries);
+                if ( prev_dom )
+                    rcu_unlock_domain(prev_dom);
                 return -ENOMEM;
             }
         }
@@ -1411,33 +1441,102 @@ int domain_context_mapping_one(
                 goto nomem;
         }
 
-        context_set_address_root(*context, pgd_maddr);
+        context_set_address_root(lctxt, pgd_maddr);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
-            context_set_translation_type(*context, CONTEXT_TT_DEV_IOTLB);
+            context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
-            context_set_translation_type(*context, CONTEXT_TT_MULTI_LEVEL);
+            context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL);
 
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    if ( context_set_domain_id(context, domain, iommu) )
+    rc = context_set_domain_id(&lctxt, domain, iommu);
+    if ( rc )
     {
+    unlock:
         spin_unlock(&iommu->lock);
         unmap_vtd_domain_page(context_entries);
-        return -EFAULT;
+        if ( prev_dom )
+            rcu_unlock_domain(prev_dom);
+        return rc;
+    }
+
+    if ( !prev_dom )
+    {
+        context_set_address_width(lctxt, agaw);
+        context_set_fault_enable(lctxt);
+        context_set_present(lctxt);
+    }
+    else if ( prev_dom == domain )
+    {
+        ASSERT(lctxt.full == context->full);
+        rc = !!pdev;
+        goto unlock;
+    }
+    else
+    {
+        ASSERT(context_address_width(lctxt) == agaw);
+        ASSERT(!context_fault_disable(lctxt));
+    }
+
+    if ( cpu_has_cx16 )
+    {
+        __uint128_t res = cmpxchg16b(context, &old, &lctxt.full);
+
+        /*
+         * Hardware does not update the context entry behind our backs,
+         * so the return value should match "old".
+         */
+        if ( res != old )
+        {
+            if ( pdev )
+                check_cleanup_domid_map(domain, pdev, iommu);
+            printk(XENLOG_ERR
+                   "%04x:%02x:%02x.%u: unexpected context entry %016lx_%016lx (expected %016lx_%016lx)\n",
+                   pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                   (uint64_t)(res >> 64), (uint64_t)res,
+                   (uint64_t)(old >> 64), (uint64_t)old);
+            rc = -EILSEQ;
+            goto unlock;
+        }
+    }
+    else if ( !prev_dom || !(mode & MAP_WITH_RMRR) )
+    {
+        context_clear_present(*context);
+        iommu_sync_cache(context, sizeof(*context));
+
+        write_atomic(&context->hi, lctxt.hi);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->lo, lctxt.lo);
+    }
+    else /* Best effort, updating DID last. */
+    {
+         /*
+          * By non-atomically updating the context entry's DID field last,
+          * during a short window in time TLB entries with the old domain ID
+          * but the new page tables may be inserted.  This could affect I/O
+          * of other devices using this same (old) domain ID.  Such updating
+          * therefore is not a problem if this was the only device associated
+          * with the old domain ID.  Diverting I/O of any of a dying domain's
+          * devices to the quarantine page tables is intended anyway.
+          */
+        if ( !(mode & (MAP_OWNER_DYING | MAP_SINGLE_DEVICE)) )
+            printk(XENLOG_WARNING VTDPREFIX
+                   " %04x:%02x:%02x.%u: reassignment may cause %pd data corruption\n",
+                   seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), prev_dom);
+
+        write_atomic(&context->lo, lctxt.lo);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->hi, lctxt.hi);
     }
 
-    context_set_address_width(*context, agaw);
-    context_set_fault_enable(*context);
-    context_set_present(*context);
     iommu_sync_cache(context, sizeof(struct context_entry));
     spin_unlock(&iommu->lock);
 
-    /* Context entry was previously non-present (with domid 0). */
-    rc = iommu_flush_context_device(iommu, 0, PCI_BDF2(bus, devfn),
-                                    DMA_CCMD_MASK_NOBIT, 1);
+    rc = iommu_flush_context_device(iommu, prev_did, PCI_BDF2(bus, devfn),
+                                    DMA_CCMD_MASK_NOBIT, !prev_dom);
     flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-    ret = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
+    ret = iommu_flush_iotlb_dsi(iommu, prev_did, !prev_dom, flush_dev_iotlb);
 
     /*
      * The current logic for returns:
@@ -1458,12 +1557,21 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, mode);
 
     if ( rc )
-        domain_context_unmap_one(domain, iommu, bus, devfn);
+    {
+        if ( !prev_dom )
+            domain_context_unmap_one(domain, iommu, bus, devfn);
+        else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+            domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                       mode & MAP_WITH_RMRR);
+    }
 
-    return rc;
+    if ( prev_dom )
+        rcu_unlock_domain(prev_dom);
+
+    return rc ?: pdev && prev_dom;
 }
 
 static int domain_context_unmap(struct domain *d, uint8_t devfn,
@@ -1473,8 +1581,11 @@ static int domain_context_mapping(struct
                                   struct pci_dev *pdev)
 {
     struct acpi_drhd_unit *drhd;
+    const struct acpi_rmrr_unit *rmrr;
     int ret = 0;
-    u8 seg = pdev->seg, bus = pdev->bus, secbus;
+    unsigned int i, mode = 0;
+    uint16_t seg = pdev->seg, bdf;
+    uint8_t bus = pdev->bus, secbus;
 
     drhd = acpi_find_matched_drhd_unit(pdev);
     if ( !drhd )
@@ -1493,8 +1604,29 @@ static int domain_context_mapping(struct
 
     ASSERT(pcidevs_locked());
 
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment != pdev->seg || bdf != pdev->sbdf.bdf )
+            continue;
+
+        mode |= MAP_WITH_RMRR;
+        break;
+    }
+
+    if ( domain != pdev->domain )
+    {
+        if ( pdev->domain->is_dying )
+            mode |= MAP_OWNER_DYING;
+        else if ( drhd &&
+                  !any_pdev_behind_iommu(pdev->domain, pdev, drhd->iommu) &&
+                  !pdev->phantom_stride )
+            mode |= MAP_SINGLE_DEVICE;
+    }
+
     switch ( pdev->type )
     {
+        bool prev_present;
+
     case DEV_TYPE_PCI_HOST_BRIDGE:
         if ( iommu_debug )
             printk(VTDPREFIX "d%d:Hostbridge: skip %04x:%02x:%02x.%u map\n",
@@ -1515,7 +1647,9 @@ static int domain_context_mapping(struct
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
+                                         pdev, mode);
+        if ( ret > 0 )
+            ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             enable_ats_device(pdev, &drhd->iommu->ats_devices);
 
@@ -1528,9 +1662,10 @@ static int domain_context_mapping(struct
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
-        if ( ret )
+                                         pdev, mode);
+        if ( ret < 0 )
             break;
+        prev_present = ret;
 
         if ( (ret = find_upstream_bridge(seg, &bus, &devfn, &secbus)) < 1 )
         {
@@ -1538,6 +1673,15 @@ static int domain_context_mapping(struct
                 break;
             ret = -ENXIO;
         }
+        /*
+         * Strictly speaking if the device is the only one behind this bridge
+         * and the only one with this (secbus,0,0) tuple, it could be allowed
+         * to be re-assigned regardless of RMRR presence.  But let's deal with
+         * that case only if it is actually found in the wild.
+         */
+        else if ( prev_present && (mode & MAP_WITH_RMRR) &&
+                  domain != pdev->domain )
+            ret = -EOPNOTSUPP;
 
         /*
          * Mapping a bridge should, if anything, pass the struct pci_dev of
@@ -1546,7 +1690,7 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL);
+                                             NULL, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1561,10 +1705,15 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL);
+                                             NULL, mode);
 
         if ( ret )
-            domain_context_unmap(domain, devfn, pdev);
+        {
+            if ( !prev_present )
+                domain_context_unmap(domain, devfn, pdev);
+            else if ( pdev->domain != domain ) /* Avoid infinite recursion. */
+                domain_context_mapping(pdev->domain, devfn, pdev);
+        }
 
         break;
 
@@ -2331,9 +2480,8 @@ static int reassign_device_ownership(
 {
     int ret;
 
-    ret = domain_context_unmap(source, devfn, pdev);
-    if ( ret )
-        return ret;
+    if ( !has_arch_pdevs(target) )
+        vmx_pi_hooks_assign(target);
 
     /*
      * Devices assigned to untrusted domains (here assumed to be any domU)
@@ -2343,6 +2491,31 @@ static int reassign_device_ownership(
     if ( (target != hardware_domain) && !iommu_intremap )
         untrusted_msi = true;
 
+    ret = domain_context_mapping(target, devfn, pdev);
+    if ( ret )
+    {
+        if ( !has_arch_pdevs(target) )
+            vmx_pi_hooks_deassign(target);
+        return ret;
+    }
+
+    if ( pdev->devfn == devfn )
+    {
+        const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+
+        if ( drhd )
+            check_cleanup_domid_map(source, pdev, drhd->iommu);
+    }
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
+
+    if ( !has_arch_pdevs(source) )
+        vmx_pi_hooks_deassign(source);
+
     /*
      * If the device belongs to the hardware domain, and it has RMRR, don't
      * remove it from the hardware domain, because BIOS may use RMRR at
@@ -2371,34 +2544,7 @@ static int reassign_device_ownership(
             }
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    if ( !has_arch_pdevs(source) )
-        vmx_pi_hooks_deassign(source);
-
-    if ( !has_arch_pdevs(target) )
-        vmx_pi_hooks_assign(target);
-
-    ret = domain_context_mapping(target, devfn, pdev);
-    if ( ret )
-    {
-        if ( !has_arch_pdevs(target) )
-            vmx_pi_hooks_deassign(target);
-
-        return ret;
-    }
-
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
-    }
-
-    return ret;
+    return 0;
 }
 
 static int intel_iommu_assign_device(
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -202,8 +202,12 @@ struct root_entry {
     do {(root).val |= ((value) & PAGE_MASK_4K);} while(0)
 
 struct context_entry {
-    u64 lo;
-    u64 hi;
+    union {
+        struct {
+            uint64_t lo, hi;
+        };
+        __uint128_t full;
+    };
 };
 #define ROOT_ENTRY_NR (PAGE_SIZE_4K/sizeof(struct root_entry))
 #define context_present(c) ((c).lo & 1)
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -343,7 +343,8 @@ void __init platform_quirks_init(void)
  */
 
 static int __must_check map_me_phantom_function(struct domain *domain,
-                                                u32 dev, int map)
+                                                unsigned int dev,
+                                                unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
     struct pci_dev *pdev;
@@ -354,9 +355,9 @@ static int __must_check map_me_phantom_f
     drhd = acpi_find_matched_drhd_unit(pdev);
 
     /* map or unmap ME phantom function */
-    if ( map )
+    if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL);
+                                        PCI_DEVFN(dev, 7), NULL, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
                                       PCI_DEVFN(dev, 7));
@@ -364,7 +365,8 @@ static int __must_check map_me_phantom_f
     return rc;
 }
 
-int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
+int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
+                  unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -388,7 +390,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, map);
+                rc = map_me_phantom_function(domain, 3, mode);
                 break;
             default:
                 break;
@@ -414,7 +416,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, map);
+                rc = map_me_phantom_function(domain, 22, mode);
                 break;
             default:
                 break;
--- a/xen/drivers/passthrough/vtd/vtd.h
+++ b/xen/drivers/passthrough/vtd/vtd.h
@@ -22,8 +22,14 @@
 
 #include <xen/iommu.h>
 
-#define MAP_ME_PHANTOM_FUNC      1
-#define UNMAP_ME_PHANTOM_FUNC    0
+/*
+ * Values for domain_context_mapping_one()'s and me_wifi_quirk()'s "mode"
+ * parameters.
+ */
+#define MAP_WITH_RMRR         (1u << 0)
+#define MAP_OWNER_DYING       (1u << 1)
+#define MAP_SINGLE_DEVICE     (1u << 2)
+#define UNMAP_ME_PHANTOM_FUNC (1u << 3)
 
 /* Allow for both IOAPIC and IOSAPIC. */
 #define IO_xAPIC_route_entry IO_APIC_route_entry
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-assign devices directly

Devices with unity map ranges, due to it being unspecified how/when
these memory ranges may get accessed, may not be left disconnected from
their unity mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than tearing down the old root page
table pointer and then establishing the new one, re-assignment needs to
be done in a single step.

This is CVE-2022-26360 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any unity map
ranges.  The main difference is when it comes to updating DTEs, which need
to be atomic when there are unity mappings. Yet atomicity can only be
achieved with CMPXCHG16B, availability of which we can't take for given.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -79,9 +79,13 @@ void amd_iommu_set_intremap_table(struct
                                   const void *ptr,
                                   const struct amd_iommu *iommu,
                                   bool valid);
-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
-				   uint64_t root_ptr, uint16_t domain_id,
-				   uint8_t paging_mode, bool valid);
+#define SET_ROOT_VALID          (1u << 0)
+#define SET_ROOT_WITH_UNITY_MAP (1u << 1)
+int __must_check amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+                                               uint64_t root_ptr,
+                                               uint16_t domain_id,
+                                               uint8_t paging_mode,
+                                               unsigned int flags);
 void iommu_dte_add_device_entry(struct amd_iommu_dte *dte,
                                 const struct ivrs_mappings *ivrs_dev);
 void iommu_dte_set_guest_cr3(struct amd_iommu_dte *dte, uint16_t dom_id,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -103,10 +103,69 @@ static unsigned int set_iommu_pte_presen
     return flush_flags;
 }
 
-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
-                                   uint64_t root_ptr, uint16_t domain_id,
-                                   uint8_t paging_mode, bool valid)
+/*
+ * This function returns
+ * - -errno for errors,
+ * - 0 for a successful update, atomic when necessary
+ * - 1 for a successful but non-atomic update, which may need to be warned
+ *   about by the caller.
+ */
+int amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+                                  uint64_t root_ptr, uint16_t domain_id,
+                                  uint8_t paging_mode, unsigned int flags)
 {
+    bool valid = flags & SET_ROOT_VALID;
+
+    if ( dte->v && dte->tv &&
+         (cpu_has_cx16 || (flags & SET_ROOT_WITH_UNITY_MAP)) )
+    {
+        union {
+            struct amd_iommu_dte dte;
+            uint64_t raw64[4];
+            __uint128_t raw128[2];
+        } ldte = { .dte = *dte };
+        __uint128_t old = ldte.raw128[0];
+        int ret = 0;
+
+        ldte.dte.domain_id = domain_id;
+        ldte.dte.pt_root = paddr_to_pfn(root_ptr);
+        ldte.dte.iw = true;
+        ldte.dte.ir = true;
+        ldte.dte.paging_mode = paging_mode;
+        ldte.dte.v = valid;
+
+        if ( cpu_has_cx16 )
+        {
+            __uint128_t res = cmpxchg16b(dte, &old, &ldte.raw128[0]);
+
+            /*
+             * Hardware does not update the DTE behind our backs, so the
+             * return value should match "old".
+             */
+            if ( res != old )
+            {
+                printk(XENLOG_ERR
+                       "Dom%d: unexpected DTE %016lx_%016lx (expected %016lx_%016lx)\n",
+                       domain_id,
+                       (uint64_t)(res >> 64), (uint64_t)res,
+                       (uint64_t)(old >> 64), (uint64_t)old);
+                ret = -EILSEQ;
+            }
+        }
+        else /* Best effort, updating domain_id last. */
+        {
+            uint64_t *ptr = (void *)dte;
+
+            write_atomic(ptr + 0, ldte.raw64[0]);
+            /* No barrier should be needed between these two. */
+            write_atomic(ptr + 1, ldte.raw64[1]);
+
+            ret = 1;
+        }
+
+        return ret;
+    }
+
     if ( valid || dte->v )
     {
         dte->tv = false;
@@ -121,6 +180,8 @@ void amd_iommu_set_root_page_table(struc
     smp_wmb();
     dte->tv = true;
     dte->v = valid;
+
+    return 0;
 }
 
 void amd_iommu_set_intremap_table(
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -85,40 +85,81 @@ int get_dma_requestor_id(uint16_t seg, u
     return req_id;
 }
 
-static void amd_iommu_setup_domain_device(
+static int __must_check allocate_domain_resources(struct domain_iommu *hd)
+{
+    int rc;
+
+    spin_lock(&hd->arch.mapping_lock);
+    rc = amd_iommu_alloc_root(hd);
+    spin_unlock(&hd->arch.mapping_lock);
+
+    return rc;
+}
+
+static bool any_pdev_behind_iommu(const struct domain *d,
+                                  const struct pci_dev *exclude,
+                                  const struct amd_iommu *iommu)
+{
+    const struct pci_dev *pdev;
+
+    for_each_pdev ( d, pdev )
+    {
+        if ( pdev == exclude )
+            continue;
+
+        if ( find_iommu_for_device(pdev->seg, pdev->sbdf.bdf) == iommu )
+            return true;
+    }
+
+    return false;
+}
+
+static int __must_check amd_iommu_setup_domain_device(
     struct domain *domain, struct amd_iommu *iommu,
     uint8_t devfn, struct pci_dev *pdev)
 {
     struct amd_iommu_dte *table, *dte;
     unsigned long flags;
-    int req_id, valid = 1;
+    unsigned int req_id, sr_flags;
+    int rc;
     u8 bus = pdev->bus;
-    const struct domain_iommu *hd = dom_iommu(domain);
+    struct domain_iommu *hd = dom_iommu(domain);
+    const struct ivrs_mappings *ivrs_dev;
 
-    BUG_ON( !hd->arch.root_table || !hd->arch.paging_mode ||
-            !iommu->dev_table.buffer );
+    BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer);
 
-    if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
-        valid = 0;
+    rc = allocate_domain_resources(hd);
+    if ( rc )
+        return rc;
+
+    req_id = get_dma_requestor_id(iommu->seg, pdev->sbdf.bdf);
+    ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
+    sr_flags = (iommu_hwdom_passthrough && is_hardware_domain(domain)
+                ? 0 : SET_ROOT_VALID)
+               | (ivrs_dev->unity_map ? SET_ROOT_WITH_UNITY_MAP : 0);
 
     /* get device-table entry */
     req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
     table = iommu->dev_table.buffer;
     dte = &table[req_id];
+    ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
     spin_lock_irqsave(&iommu->lock, flags);
 
     if ( !dte->v || !dte->tv )
     {
-        const struct ivrs_mappings *ivrs_dev;
-
         /* bind DTE to domain page-tables */
-        amd_iommu_set_root_page_table(
-            dte, page_to_maddr(hd->arch.root_table), domain->domain_id,
-            hd->arch.paging_mode, valid);
+        rc = amd_iommu_set_root_page_table(
+                 dte, page_to_maddr(hd->arch.root_table),
+                 domain->domain_id, hd->arch.paging_mode, sr_flags);
+        if ( rc )
+        {
+            ASSERT(rc < 0);
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
 
         /* Undo what amd_iommu_disable_domain_device() may have done. */
-        ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
         if ( dte->it_root )
         {
             dte->int_ctl = IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED;
@@ -133,17 +174,74 @@ static void amd_iommu_setup_domain_devic
             dte->i = ats_enabled;
 
         amd_iommu_flush_device(iommu, req_id);
+    }
+    else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.root_table)) )
+    {
+        /*
+         * Strictly speaking if the device is the only one with this requestor
+         * ID, it could be allowed to be re-assigned regardless of unity map
+         * presence.  But let's deal with that case only if it is actually
+         * found in the wild.
+         */
+        if ( req_id != PCI_BDF2(bus, devfn) &&
+             (sr_flags & SET_ROOT_WITH_UNITY_MAP) )
+            rc = -EOPNOTSUPP;
+        else
+            rc = amd_iommu_set_root_page_table(
+                     dte, page_to_maddr(hd->arch.root_table),
+                     domain->domain_id, hd->arch.paging_mode, sr_flags);
+        if ( rc < 0 )
+        {
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
+        if ( rc &&
+             domain != pdev->domain &&
+             /*
+              * By non-atomically updating the DTE's domain ID field last,
+              * during a short window in time TLB entries with the old domain
+              * ID but the new page tables may have been inserted.  This could
+              * affect I/O of other devices using this same (old) domain ID.
+              * Such updating therefore is not a problem if this was the only
+              * device associated with the old domain ID.  Diverting I/O of any
+              * of a dying domain's devices to the quarantine page tables is
+              * intended anyway.
+              */
+             !pdev->domain->is_dying &&
+             (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
+              pdev->phantom_stride) )
+            printk(" %04x:%02x:%02x.%u: reassignment may cause %pd data corruption\n",
+                   pdev->seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                   pdev->domain);
+
+        /*
+         * Check remaining settings are still in place from an earlier call
+         * here. They're all independent of the domain, so should not have
+         * changed.
+         */
+        if ( dte->it_root )
+            ASSERT(dte->int_ctl == IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED);
+        ASSERT(dte->iv == iommu_intremap);
+        ASSERT(dte->ex == ivrs_dev->dte_allow_exclusion);
+        ASSERT(dte->sys_mgt == MASK_EXTR(ivrs_dev->device_flags,
+                                         ACPI_IVHD_SYSTEM_MGMT));
 
-        AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
-                        "root table = %#"PRIx64", "
-                        "domain = %d, paging mode = %d\n",
-                        req_id, pdev->type,
-                        page_to_maddr(hd->arch.root_table),
-                        domain->domain_id, hd->arch.paging_mode);
+        if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
+             iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) )
+            ASSERT(dte->i == ats_enabled);
+
+        amd_iommu_flush_device(iommu, req_id);
     }
 
     spin_unlock_irqrestore(&iommu->lock, flags);
 
+    AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
+                    "root table = %#"PRIx64", "
+                    "domain = %d, paging mode = %d\n",
+                    req_id, pdev->type,
+                    page_to_maddr(hd->arch.root_table),
+                    domain->domain_id, hd->arch.paging_mode);
+
     ASSERT(pcidevs_locked());
 
     if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
@@ -154,6 +252,8 @@ static void amd_iommu_setup_domain_devic
 
         amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
     }
+
+    return 0;
 }
 
 int __init acpi_ivrs_init(void)
@@ -223,17 +323,6 @@ int amd_iommu_alloc_root(struct domain_i
     return 0;
 }
 
-static int __must_check allocate_domain_resources(struct domain_iommu *hd)
-{
-    int rc;
-
-    spin_lock(&hd->arch.mapping_lock);
-    rc = amd_iommu_alloc_root(hd);
-    spin_unlock(&hd->arch.mapping_lock);
-
-    return rc;
-}
-
 int __read_mostly amd_iommu_min_paging_mode = 1;
 
 static int amd_iommu_domain_init(struct domain *d)
@@ -333,7 +422,6 @@ static int reassign_device(struct domain
 {
     struct amd_iommu *iommu;
     int bdf, rc;
-    struct domain_iommu *t = dom_iommu(target);
     const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
 
     bdf = PCI_BDF2(pdev->bus, pdev->devfn);
@@ -347,7 +435,15 @@ static int reassign_device(struct domain
         return -ENODEV;
     }
 
-    amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
+    rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
+    if ( rc )
+        return rc;
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
 
     /*
      * If the device belongs to the hardware domain, and it has a unity mapping,
@@ -363,27 +459,10 @@ static int reassign_device(struct domain
             return rc;
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    rc = allocate_domain_resources(t);
-    if ( rc )
-        return rc;
-
-    amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
     AMD_IOMMU_DEBUG("Re-assign %04x:%02x:%02x.%u from dom%d to dom%d\n",
                     pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                     source->domain_id, target->domain_id);
 
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
-    }
-
     return 0;
 }
 
@@ -547,8 +626,7 @@ static int amd_iommu_add_device(u8 devfn
         spin_unlock_irqrestore(&iommu->lock, flags);
     }
 
-    amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
-    return 0;
+    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part I)

Arrange for domain ID and page table root to be passed around, the latter in
particular to domain_pgd_maddr() such that taking it from the per-domain
fields can be overridden.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -86,9 +86,10 @@ void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(void *va);
 int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
                                uint8_t bus, uint8_t devfn,
-                               const struct pci_dev *pdev, unsigned int mode);
+                               const struct pci_dev *pdev, domid_t domid,
+                               paddr_t pgd_maddr, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
-                             u8 bus, u8 devfn);
+                             uint8_t bus, uint8_t devfn, domid_t domid);
 int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 
 unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
@@ -107,7 +108,8 @@ void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
 int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
-                               uint8_t devfn, unsigned int mode);
+                               uint8_t devfn, domid_t domid, paddr_t pgd_maddr,
+                               unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1364,12 +1364,12 @@ int domain_context_mapping_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
     uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
-    unsigned int mode)
+    domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     struct context_entry *context, *context_entries, lctxt;
     __uint128_t old;
-    u64 maddr, pgd_maddr;
+    uint64_t maddr;
     uint16_t seg = iommu->drhd->segment, prev_did = 0;
     struct domain *prev_dom = NULL;
     int agaw, rc, ret;
@@ -1410,10 +1410,12 @@ int domain_context_mapping_one(
     }
     else
     {
+        paddr_t root = pgd_maddr;
+
         spin_lock(&hd->arch.mapping_lock);
 
         /* Ensure we have pagetables allocated down to leaf PTE. */
-        if ( hd->arch.pgd_maddr == 0 )
+        if ( !root )
         {
             addr_to_dma_page_maddr(domain, 0, 1);
             if ( hd->arch.pgd_maddr == 0 )
@@ -1426,22 +1428,24 @@ int domain_context_mapping_one(
                     rcu_unlock_domain(prev_dom);
                 return -ENOMEM;
             }
+
+            root = hd->arch.pgd_maddr;
         }
 
         /* Skip top levels of page tables for 2- and 3-level DRHDs. */
-        pgd_maddr = hd->arch.pgd_maddr;
         for ( agaw = level_to_agaw(4);
               agaw != level_to_agaw(iommu->nr_pt_levels);
               agaw-- )
         {
-            struct dma_pte *p = map_vtd_domain_page(pgd_maddr);
-            pgd_maddr = dma_pte_addr(*p);
+            struct dma_pte *p = map_vtd_domain_page(root);
+
+            root = dma_pte_addr(*p);
             unmap_vtd_domain_page(p);
-            if ( pgd_maddr == 0 )
+            if ( !root )
                 goto nomem;
         }
 
-        context_set_address_root(lctxt, pgd_maddr);
+        context_set_address_root(lctxt, root);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
             context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
@@ -1557,15 +1561,21 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, mode);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode);
 
     if ( rc )
     {
         if ( !prev_dom )
-            domain_context_unmap_one(domain, iommu, bus, devfn);
+            domain_context_unmap_one(domain, iommu, bus, devfn,
+                                     domain->domain_id);
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+        {
+            hd = dom_iommu(prev_dom);
             domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                       domain->domain_id,
+                                       hd->arch.pgd_maddr,
                                        mode & MAP_WITH_RMRR);
+        }
     }
 
     if ( prev_dom )
@@ -1582,6 +1592,7 @@ static int domain_context_mapping(struct
 {
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
+    paddr_t pgd_maddr = dom_iommu(domain)->arch.pgd_maddr;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1647,7 +1658,8 @@ static int domain_context_mapping(struct
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret > 0 )
             ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
@@ -1662,7 +1674,8 @@ static int domain_context_mapping(struct
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1690,7 +1703,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1705,7 +1719,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         if ( ret )
         {
@@ -1734,7 +1749,7 @@ static int domain_context_mapping(struct
 int domain_context_unmap_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
-    u8 bus, u8 devfn)
+    uint8_t bus, uint8_t devfn, domid_t domid)
 {
     struct context_entry *context, *context_entries;
     u64 maddr;
@@ -1792,7 +1807,7 @@ int domain_context_unmap_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !iommu->drhd->segment && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, 0, UNMAP_ME_PHANTOM_FUNC);
 
     if ( rc && !is_hardware_domain(domain) && domain != dom_io )
     {
@@ -1844,7 +1859,8 @@ static int domain_context_unmap(struct d
             printk(VTDPREFIX "d%d:PCIe: unmap %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1854,7 +1870,8 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "d%d:PCI: unmap %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( ret )
             break;
 
@@ -1880,12 +1897,15 @@ static int domain_context_unmap(struct d
         /* PCIe to PCI/PCIx bridge */
         if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
         {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
             if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0);
+                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                               domain->domain_id);
         }
         else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
 
         break;
 
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -344,6 +344,8 @@ void __init platform_quirks_init(void)
 
 static int __must_check map_me_phantom_function(struct domain *domain,
                                                 unsigned int dev,
+                                                domid_t domid,
+                                                paddr_t pgd_maddr,
                                                 unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
@@ -357,16 +359,17 @@ static int __must_check map_me_phantom_f
     /* map or unmap ME phantom function */
     if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL, mode);
+                                        PCI_DEVFN(dev, 7), NULL,
+                                        domid, pgd_maddr, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
-                                      PCI_DEVFN(dev, 7));
+                                      PCI_DEVFN(dev, 7), domid);
 
     return rc;
 }
 
 int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
-                  unsigned int mode)
+                  domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -390,7 +393,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, mode);
+                rc = map_me_phantom_function(domain, 3, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
@@ -416,7 +419,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, mode);
+                rc = map_me_phantom_function(domain, 22, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part II)

Replace the passing of struct domain * by domid_t in preparation of
per-device quarantine page tables also requiring per-device pseudo
domain IDs, which aren't going to be associated with any struct domain
instances.

No functional change intended (except for slightly adjusted log message
text).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -52,8 +52,8 @@ static struct tasklet vtd_fault_tasklet;
 static int setup_hwdom_device(u8 devfn, struct pci_dev *);
 static void setup_hwdom_rmrr(struct domain *d);
 
-static int domain_iommu_domid(struct domain *d,
-                              struct vtd_iommu *iommu)
+static int get_iommu_did(domid_t domid, const struct vtd_iommu *iommu,
+                         bool warn)
 {
     unsigned long nr_dom, i;
 
@@ -61,16 +61,16 @@ static int domain_iommu_domid(struct dom
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
             return i;
 
         i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
     }
 
-    if ( !d->is_dying )
+    if ( warn )
         dprintk(XENLOG_ERR VTDPREFIX,
-                "Cannot get valid iommu %u domid: %pd\n",
-                iommu->index, d);
+                "No valid iommu %u domid for Dom%d\n",
+                iommu->index, domid);
 
     return -1;
 }
@@ -78,8 +78,7 @@ static int domain_iommu_domid(struct dom
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
 static int context_set_domain_id(struct context_entry *context,
-                                 struct domain *d,
-                                 struct vtd_iommu *iommu)
+                                 domid_t domid, struct vtd_iommu *iommu)
 {
     unsigned long nr_dom, i;
     int found = 0;
@@ -90,7 +89,7 @@ static int context_set_domain_id(struct
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
         {
             found = 1;
             break;
@@ -106,7 +105,7 @@ static int context_set_domain_id(struct
             dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: no free domain ids\n");
             return -EFAULT;
         }
-        iommu->domid_map[i] = d->domain_id;
+        iommu->domid_map[i] = domid;
     }
 
     set_bit(i, iommu->domid_bitmap);
@@ -115,9 +114,9 @@ static int context_set_domain_id(struct
     return 0;
 }
 
-static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
+static void cleanup_domid_map(domid_t domid, struct vtd_iommu *iommu)
 {
-    int iommu_domid = domain_iommu_domid(domain, iommu);
+    int iommu_domid = get_iommu_did(domid, iommu, false);
 
     if ( iommu_domid >= 0 )
     {
@@ -173,7 +172,7 @@ static void check_cleanup_domid_map(stru
     if ( !found )
     {
         clear_bit(iommu->index, &dom_iommu(d)->arch.iommu_bitmap);
-        cleanup_domid_map(d, iommu);
+        cleanup_domid_map(d->domain_id, iommu);
     }
 }
 
@@ -630,7 +629,7 @@ static int __must_check iommu_flush_iotl
             continue;
 
         flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-        iommu_domid= domain_iommu_domid(d, iommu);
+        iommu_domid = get_iommu_did(d->domain_id, iommu, !d->is_dying);
         if ( iommu_domid == -1 )
             continue;
 
@@ -1454,7 +1453,7 @@ int domain_context_mapping_one(
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    rc = context_set_domain_id(&lctxt, domain, iommu);
+    rc = context_set_domain_id(&lctxt, domid, iommu);
     if ( rc )
     {
     unlock:
@@ -1774,7 +1773,7 @@ int domain_context_unmap_one(
     context_clear_entry(*context);
     iommu_sync_cache(context, sizeof(struct context_entry));
 
-    iommu_domid= domain_iommu_domid(domain, iommu);
+    iommu_domid = get_iommu_did(domid, iommu, !domain->is_dying);
     if ( iommu_domid == -1 )
     {
         spin_unlock(&iommu->lock);
@@ -1948,7 +1947,7 @@ static void iommu_domain_teardown(struct
     spin_unlock(&hd->arch.mapping_lock);
 
     for_each_drhd_unit ( drhd )
-        cleanup_domid_map(d, drhd->iommu);
+        cleanup_domid_map(d->domain_id, drhd->iommu);
 }
 
 static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: maintain a per-device pseudo domain ID

In order to subsequently enable per-device quarantine page tables, we'll
need domain-ID-like identifiers to be inserted in the respective device
(AMD) or context (Intel) table entries alongside the per-device page
table root addresses.

Make use of "real" domain IDs occupying only half of the value range
coverable by domid_t.

Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
leaks in case of error, but existing ones don't get plugged - that'll be
the subject of a later change.

The VT-d changes are slightly asymmetric, but this way we can avoid
assigning pseudo domain IDs to devices which would never be mapped while
still avoiding to add a new parameter to domain_context_unmap().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -130,6 +130,10 @@ int pi_update_irte(const struct pi_desc
         iommu_vcall(ops, sync_cache, addr, size);       \
 })
 
+unsigned long *iommu_init_domid(void);
+domid_t iommu_alloc_domid(unsigned long *map);
+void iommu_free_domid(domid_t domid, unsigned long *map);
+
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
  * Local variables:
--- a/xen/include/asm-x86/pci.h
+++ b/xen/include/asm-x86/pci.h
@@ -15,6 +15,12 @@
 
 struct arch_pci_dev {
     vmask_t used_vectors;
+    /*
+     * These fields are (de)initialized under pcidevs-lock. Other uses of
+     * them don't race (de)initialization and hence don't strictly need any
+     * locking.
+     */
+    domid_t pseudo_domid;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
--- a/xen/include/asm-x86/amd-iommu.h
+++ b/xen/include/asm-x86/amd-iommu.h
@@ -94,6 +94,7 @@ struct amd_iommu {
     struct ring_buffer cmd_buffer;
     struct ring_buffer event_log;
     struct ring_buffer ppr_log;
+    unsigned long *domid_map;
 
     int exclusion_enable;
     int exclusion_allow_all;
--- a/xen/drivers/passthrough/amd/iommu_detect.c
+++ b/xen/drivers/passthrough/amd/iommu_detect.c
@@ -183,6 +183,11 @@ int __init amd_iommu_detect_one_acpi(
     if ( rt )
         goto out;
 
+    iommu->domid_map = iommu_init_domid();
+    rt = -ENOMEM;
+    if ( !iommu->domid_map )
+        goto out;
+
     rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
     if ( rt )
         printk(XENLOG_ERR
@@ -194,7 +199,10 @@ int __init amd_iommu_detect_one_acpi(
 
  out:
     if ( rt )
+    {
+        xfree(iommu->domid_map);
         xfree(iommu);
+    }
 
     return rt;
 }
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -563,6 +563,8 @@ static int amd_iommu_add_device(u8 devfn
     struct amd_iommu *iommu;
     u16 bdf;
     struct ivrs_mappings *ivrs_mappings;
+    bool fresh_domid = false;
+    int ret;
 
     if ( !pdev->domain )
         return -EINVAL;
@@ -626,7 +628,22 @@ static int amd_iommu_add_device(u8 devfn
         spin_unlock_irqrestore(&iommu->lock, flags);
     }
 
-    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( iommu_quarantine && pdev->arch.pseudo_domid == DOMID_INVALID )
+    {
+        pdev->arch.pseudo_domid = iommu_alloc_domid(iommu->domid_map);
+        if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+            return -ENOSPC;
+        fresh_domid = true;
+    }
+
+    ret = amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( ret && fresh_domid )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
+    return ret;
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
@@ -651,6 +668,9 @@ static int amd_iommu_remove_device(u8 de
 
     amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
 
+    iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+    pdev->arch.pseudo_domid = DOMID_INVALID;
+
     ivrs_mappings = get_ivrs_mappings(pdev->seg);
     bdf = PCI_BDF2(pdev->bus, devfn);
     if ( amd_iommu_perdev_intremap &&
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -338,6 +338,7 @@ static struct pci_dev *alloc_pdev(struct
     *((u8*) &pdev->bus) = bus;
     *((u8*) &pdev->devfn) = devfn;
     pdev->domain = NULL;
+    pdev->arch.pseudo_domid = DOMID_INVALID;
     INIT_LIST_HEAD(&pdev->msi_list);
 
     pos = pci_find_cap_offset(pseg->nr, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
@@ -1353,9 +1354,13 @@ static int _dump_pci_devices(struct pci_
 
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
     {
-        printk("%04x:%02x:%02x.%u - %pd - node %-3d - MSIs < ",
-               pseg->nr, pdev->bus,
-               PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn), pdev->domain,
+        printk("%04x:%02x:%02x.%u - ", pseg->nr, pdev->bus,
+               PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+        if ( pdev->domain == dom_io )
+            printk("DomIO:%x", pdev->arch.pseudo_domid);
+        else
+            printk("%pd", pdev->domain);
+        printk(" - node %-3d - MSIs < ",
                (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
         list_for_each_entry ( msi, &pdev->msi_list, list )
                printk("%d ", msi->irq);
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -22,6 +22,7 @@
 #include <xen/sched.h>
 #include <xen/xmalloc.h>
 #include <xen/domain_page.h>
+#include <xen/err.h>
 #include <xen/iocap.h>
 #include <xen/iommu.h>
 #include <xen/numa.h>
@@ -1192,7 +1193,7 @@ int __init iommu_alloc(struct acpi_drhd_
 {
     struct vtd_iommu *iommu;
     unsigned long sagaw, nr_dom;
-    int agaw;
+    int agaw, rc;
 
     if ( nr_iommus >= MAX_IOMMUS )
     {
@@ -1285,7 +1286,16 @@ int __init iommu_alloc(struct acpi_drhd_
     if ( !iommu->domid_map )
         return -ENOMEM;
 
+    iommu->pseudo_domid_map = iommu_init_domid();
+    rc = -ENOMEM;
+    if ( !iommu->pseudo_domid_map )
+        goto free;
+
     return 0;
+
+ free:
+    iommu_free(drhd);
+    return rc;
 }
 
 void __init iommu_free(struct acpi_drhd_unit *drhd)
@@ -1308,6 +1318,7 @@ void __init iommu_free(struct acpi_drhd_
 
     xfree(iommu->domid_bitmap);
     xfree(iommu->domid_map);
+    xfree(iommu->pseudo_domid_map);
 
     if ( iommu->msi.irq >= 0 )
         destroy_irq(iommu->msi.irq);
@@ -1583,8 +1594,8 @@ int domain_context_mapping_one(
     return rc ?: pdev && prev_dom;
 }
 
-static int domain_context_unmap(struct domain *d, uint8_t devfn,
-                                struct pci_dev *pdev);
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *d, uint8_t devfn, struct pci_dev *pdev);
 
 static int domain_context_mapping(struct domain *domain, u8 devfn,
                                   struct pci_dev *pdev)
@@ -1592,6 +1603,7 @@ static int domain_context_mapping(struct
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
     paddr_t pgd_maddr = dom_iommu(domain)->arch.pgd_maddr;
+    domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1652,6 +1664,14 @@ static int domain_context_mapping(struct
         break;
 
     case DEV_TYPE_PCIe_ENDPOINT:
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "d%d:PCIe: map %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus,
@@ -1667,6 +1687,14 @@ static int domain_context_mapping(struct
         break;
 
     case DEV_TYPE_PCI:
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "d%d:PCI: map %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus,
@@ -1742,6 +1770,13 @@ static int domain_context_mapping(struct
     if ( !ret && devfn == pdev->devfn )
         pci_vtd_quirk(pdev);
 
+    if ( ret && drhd && orig_domid == DOMID_INVALID )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return ret;
 }
 
@@ -1824,8 +1859,10 @@ int domain_context_unmap_one(
     return rc;
 }
 
-static int domain_context_unmap(struct domain *domain, u8 devfn,
-                                struct pci_dev *pdev)
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *domain,
+    uint8_t devfn,
+    struct pci_dev *pdev)
 {
     struct acpi_drhd_unit *drhd;
     struct vtd_iommu *iommu;
@@ -1834,7 +1871,7 @@ static int domain_context_unmap(struct d
 
     drhd = acpi_find_matched_drhd_unit(pdev);
     if ( !drhd )
-        return -ENODEV;
+        return ERR_PTR(-ENODEV);
     iommu = drhd->iommu;
 
     switch ( pdev->type )
@@ -1845,7 +1882,7 @@ static int domain_context_unmap(struct d
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
         if ( !is_hardware_domain(domain) )
-            return -EPERM;
+            return ERR_PTR(-EPERM);
         goto out;
 
     case DEV_TYPE_PCIe_BRIDGE:
@@ -1923,7 +1960,7 @@ static int domain_context_unmap(struct d
         check_cleanup_domid_map(domain, pdev, iommu);
 
 out:
-    return ret;
+    return ret ? ERR_PTR(ret) : drhd;
 }
 
 static void iommu_domain_teardown(struct domain *d)
@@ -2145,16 +2182,17 @@ static int intel_iommu_enable_device(str
 
 static int intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
 {
+    const struct acpi_drhd_unit *drhd;
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int ret, i;
+    unsigned int i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_unmap(pdev->domain, devfn, pdev);
-    if ( ret )
-        return ret;
+    drhd = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( IS_ERR(drhd) )
+        return PTR_ERR(drhd);
 
     for_each_rmrr_device ( rmrr, bdf, i )
     {
@@ -2171,6 +2209,13 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
+    if ( drhd )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return 0;
 }
 
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -535,6 +535,7 @@ struct vtd_iommu {
     } flush;
 
     struct list_head ats_devices;
+    unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */
     unsigned long *domid_bitmap;  /* domain id bitmap */
     u16 *domid_map;               /* domain id mapping array */
     uint32_t version;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -346,6 +346,53 @@ void __hwdom_init arch_iommu_hwdom_init(
         return;
 }
 
+unsigned long *__init iommu_init_domid(void)
+{
+    if ( !iommu_quarantine )
+        return ZERO_BLOCK_PTR;
+
+    BUILD_BUG_ON(DOMID_MASK * 2U >= UINT16_MAX);
+
+    return xzalloc_array(unsigned long,
+                         BITS_TO_LONGS(UINT16_MAX - DOMID_MASK));
+}
+
+domid_t iommu_alloc_domid(unsigned long *map)
+{
+    /*
+     * This is used uniformly across all IOMMUs, such that on typical
+     * systems we wouldn't re-use the same ID very quickly (perhaps never).
+     */
+    static unsigned int start;
+    unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start);
+
+    ASSERT(pcidevs_locked());
+
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK);
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        return DOMID_INVALID;
+
+    __set_bit(idx, map);
+
+    start = idx + 1;
+
+    return idx | (DOMID_MASK + 1);
+}
+
+void iommu_free_domid(domid_t domid, unsigned long *map)
+{
+    ASSERT(pcidevs_locked());
+
+    if ( domid == DOMID_INVALID )
+        return;
+
+    ASSERT(domid > DOMID_MASK);
+
+    if ( !__test_and_clear_bit(domid & DOMID_MASK, map) )
+        BUG();
+}
+
 /*
  * Local variables:
  * mode: C
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -614,6 +614,9 @@ DEFINE_XEN_GUEST_HANDLE(mmuext_op_t);
 /* Idle domain. */
 #define DOMID_IDLE           xen_mk_uint(0x7FFF)
 
+/* Mask for valid domain id values */
+#define DOMID_MASK           xen_mk_uint(0x7FFF)
+
 #ifndef __ASSEMBLY__
 
 typedef uint16_t domid_t;
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: drop TLB flushes from quarantine_init() hooks

The page tables just created aren't hooked up yet anywhere, so there's
nothing that could be present in any TLB, and hence nothing to flush.
Dropping this flush is, at least on the VT-d side, a prereq to per-
device domain ID use when quarantining devices, as dom_io isn't going
to be assigned a DID anymore: The warning in get_iommu_did() would
trigger.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -595,8 +595,6 @@ int __init amd_iommu_quarantine_init(str
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    amd_iommu_flush_all_pages(d);
-
     /* Pages leaked in failure case */
     return level ? -ENOMEM : 0;
 }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2894,7 +2894,6 @@ static int __init intel_iommu_quarantine
     struct dma_pte *parent;
     unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
     unsigned int level = agaw_to_level(agaw);
-    int rc;
 
     if ( hd->arch.pgd_maddr )
     {
@@ -2941,10 +2940,8 @@ static int __init intel_iommu_quarantine
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    rc = iommu_flush_iotlb_all(d);
-
     /* Pages leaked in failure case */
-    return level ? -ENOMEM : rc;
+    return level ? -ENOMEM : 0;
 }
 
 const struct iommu_ops __initconstrel intel_iommu_ops = {
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: abstract maximum number of page table levels

We will want to use the constant elsewhere.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -193,7 +193,7 @@ static inline int amd_iommu_get_paging_m
     while ( max_frames > PTE_PER_TABLE_SIZE )
     {
         max_frames = PTE_PER_TABLE_ALIGN(max_frames) >> PTE_PER_TABLE_SHIFT;
-        if ( ++level > 6 )
+        if ( ++level > IOMMU_MAX_PT_LEVELS )
             return -ENOMEM;
     }
 
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h
@@ -110,6 +110,7 @@ struct amd_iommu_dte {
     bool tv:1;
     unsigned int :5;
     unsigned int had:2;
+#define IOMMU_MAX_PT_LEVELS 6
     unsigned int paging_mode:3;
     uint64_t pt_root:40;
     bool ppr:1;
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -260,7 +260,7 @@ static int iommu_pde_from_dfn(struct dom
     table = hd->arch.root_table;
     level = hd->arch.paging_mode;
 
-    BUG_ON( table == NULL || level < 1 || level > 6 );
+    BUG_ON( table == NULL || level < 1 || level > IOMMU_MAX_PT_LEVELS );
 
     /*
      * A frame number past what the current page tables can represent can't
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: use per-device page tables for quarantining

Devices with RMRRs / unity mapped regions, due to it being unspecified
how/when these memory regions may be accessed, may not be left
disconnected from the mappings of these regions (as long as it's not
certain that the device has been fully quiesced). Hence even the page
tables used when quarantining such devices need to have mappings of
those regions. This implies installing page tables in the first place
even when not in scratch-page quarantining mode.

This is CVE-2022-26361 / part of XSA-400.

While for the purpose here it would be sufficient to have devices with
RMRRs / unity mapped regions use per-device page tables, extend this to
all devices (in scratch-page quarantining mode). This allows the leaf
pages to be mapped r/w, thus covering also memory writes (rather than
just reads) issued by non-quiescent devices.

Set up quarantine page tables as late as possible, yet early enough to
not encounter failure during de-assign. This means setup generally
happens in assign_device(), while (for now) the one in deassign_device()
is there mainly to be on the safe side.

In VT-d's DID allocation function don't require the IOMMU lock to be
held anymore: All involved code paths hold pcidevs_lock, so this way we
avoid the need to acquire the IOMMU lock around the new call to
context_set_domain_id().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1453,7 +1453,7 @@ int set_identity_p2m_entry(struct domain
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int ret;
 
-    if ( !paging_mode_translate(p2m->domain) )
+    if ( !paging_mode_translate(d) )
     {
         if ( !is_iommu_enabled(d) )
             return 0;
--- a/xen/include/asm-x86/pci.h
+++ b/xen/include/asm-x86/pci.h
@@ -1,6 +1,8 @@
 #ifndef __X86_PCI_H__
 #define __X86_PCI_H__
 
+#include <xen/mm.h>
+
 #define CF8_BDF(cf8)     (  ((cf8) & 0x00ffff00) >> 8)
 #define CF8_ADDR_LO(cf8) (   (cf8) & 0x000000fc)
 #define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
@@ -20,7 +22,18 @@ struct arch_pci_dev {
      * them don't race (de)initialization and hence don't strictly need any
      * locking.
      */
+    union {
+        /* Subset of struct arch_iommu's fields, to be used in dom_io. */
+        struct {
+            uint64_t pgd_maddr;
+        } vtd;
+        struct {
+            struct page_info *root_table;
+        } amd;
+    };
     domid_t pseudo_domid;
+    mfn_t leaf_mfn;
+    struct page_list_head pgtables_list;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -54,7 +54,8 @@ int amd_iommu_init_late(void);
 int amd_iommu_update_ivrs_mapping_acpi(void);
 int iov_adjust_irq_affinities(void);
 
-int amd_iommu_quarantine_init(struct domain *d);
+int amd_iommu_quarantine_init(struct pci_dev *pdev);
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev);
 
 /* mapping functions */
 int __must_check amd_iommu_map_page(struct domain *d, dfn_t dfn,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -539,64 +539,137 @@ int amd_iommu_reserve_domain_unity_unmap
     return rc;
 }
 
-int __init amd_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(union amd_iommu_pte *this, unsigned int level,
+                    struct page_info *pgs[IOMMU_MAX_PT_LEVELS],
+                    struct pci_dev *pdev)
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_PER_TABLE_SIZE; ++i )
+    {
+        union amd_iommu_pte *pte = &this[i], *next;
+
+        if ( !pte->pr )
+        {
+            if ( !pgs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                pgs[level] = alloc_amd_iommu_pgtable();
+                if ( !pgs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                page_list_add(pgs[level], &pdev->arch.pgtables_list);
+
+                if ( level )
+                {
+                    next = __map_domain_page(pgs[level]);
+                    rc = fill_qpt(next, level - 1, pgs, pdev);
+                    unmap_domain_page(next);
+                }
+            }
+
+            /*
+             * PDEs are essentially a subset of PTEs, so this function
+             * is fine to use even at the leaf.
+             */
+            set_iommu_pde_present(pte, mfn_x(page_to_mfn(pgs[level])), level,
+                                  true, true);
+        }
+        else if ( level && pte->next_level )
+        {
+            page_list_add(mfn_to_page(_mfn(pte->mfn)),
+                          &pdev->arch.pgtables_list);
+            next = map_domain_page(_mfn(pte->mfn));
+            rc = fill_qpt(next, level - 1, pgs, pdev);
+            unmap_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+int amd_iommu_quarantine_init(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
     unsigned long end_gfn =
         1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT);
     unsigned int level = amd_iommu_get_paging_mode(end_gfn);
-    union amd_iommu_pte *table;
+    unsigned int req_id = get_dma_requestor_id(pdev->seg, pdev->sbdf.bdf);
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
+    int rc;
 
-    if ( hd->arch.root_table )
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.root_table);
+
+    ASSERT(pdev->arch.pseudo_domid != DOMID_INVALID);
+
+    if ( pdev->arch.amd.root_table )
     {
-        ASSERT_UNREACHABLE();
+        clear_domain_page(pdev->arch.leaf_mfn);
         return 0;
     }
 
-    spin_lock(&hd->arch.mapping_lock);
-
-    hd->arch.root_table = alloc_amd_iommu_pgtable();
-    if ( !hd->arch.root_table )
-        goto out;
-
-    table = __map_domain_page(hd->arch.root_table);
-    while ( level )
+    pdev->arch.amd.root_table = alloc_amd_iommu_pgtable();
+    if ( !pdev->arch.amd.root_table )
+        return -ENOMEM;
+
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
+    hd->arch.root_table = pdev->arch.amd.root_table;
+
+    rc = amd_iommu_reserve_domain_unity_map(dom_io,
+                                            ivrs_mappings[req_id].unity_map,
+                                            0);
+
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.root_table = NULL;
+
+    if ( rc )
+        printk("%04x:%02x:%02x.%u: quarantine unity mapping failed\n",
+               pdev->seg, pdev->bus,
+               PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+    else
     {
-        struct page_info *pg;
-        unsigned int i;
+        union amd_iommu_pte *root;
+        struct page_info *pgs[IOMMU_MAX_PT_LEVELS] = {};
 
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        pg = alloc_amd_iommu_pgtable();
-        if ( !pg )
-            break;
+        spin_lock(&hd->arch.mapping_lock);
 
-        for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
-        {
-            union amd_iommu_pte *pde = &table[i];
+        root = __map_domain_page(pdev->arch.amd.root_table);
+        rc = fill_qpt(root, level - 1, pgs, pdev);
+        unmap_domain_page(root);
 
-            /*
-             * PDEs are essentially a subset of PTEs, so this function
-             * is fine to use even at the leaf.
-             */
-            set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level - 1,
-                                  false, true);
-        }
+        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
 
-        unmap_domain_page(table);
-        table = __map_domain_page(pg);
-        level--;
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_domain_page(table);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    if ( rc )
+        amd_iommu_quarantine_teardown(pdev);
+
+    return rc;
+}
+
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
+{
+    struct page_info *pg;
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.amd.root_table )
+        return;
+
+    while ( (pg = page_list_remove_head(&pdev->arch.pgtables_list)) )
+        free_amd_iommu_pgtable(pg);
 
-    /* Pages leaked in failure case */
-    return level ? -ENOMEM : 0;
+    pdev->arch.amd.root_table = NULL;
 }
 
 /*
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -125,6 +125,8 @@ static int __must_check amd_iommu_setup_
     u8 bus = pdev->bus;
     struct domain_iommu *hd = dom_iommu(domain);
     const struct ivrs_mappings *ivrs_dev;
+    const struct page_info *root_pg;
+    domid_t domid;
 
     BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer);
 
@@ -144,14 +146,25 @@ static int __must_check amd_iommu_setup_
     dte = &table[req_id];
     ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
+    if ( domain != dom_io )
+    {
+        root_pg = hd->arch.root_table;
+        domid = domain->domain_id;
+    }
+    else
+    {
+        root_pg = pdev->arch.amd.root_table;
+        domid = pdev->arch.pseudo_domid;
+    }
+
     spin_lock_irqsave(&iommu->lock, flags);
 
     if ( !dte->v || !dte->tv )
     {
         /* bind DTE to domain page-tables */
         rc = amd_iommu_set_root_page_table(
-                 dte, page_to_maddr(hd->arch.root_table),
-                 domain->domain_id, hd->arch.paging_mode, sr_flags);
+                 dte, page_to_maddr(root_pg), domid,
+                 hd->arch.paging_mode, sr_flags);
         if ( rc )
         {
             ASSERT(rc < 0);
@@ -175,7 +188,7 @@ static int __must_check amd_iommu_setup_
 
         amd_iommu_flush_device(iommu, req_id);
     }
-    else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.root_table)) )
+    else if ( dte->pt_root != mfn_x(page_to_mfn(root_pg)) )
     {
         /*
          * Strictly speaking if the device is the only one with this requestor
@@ -188,8 +201,8 @@ static int __must_check amd_iommu_setup_
             rc = -EOPNOTSUPP;
         else
             rc = amd_iommu_set_root_page_table(
-                     dte, page_to_maddr(hd->arch.root_table),
-                     domain->domain_id, hd->arch.paging_mode, sr_flags);
+                     dte, page_to_maddr(root_pg), domid,
+                     hd->arch.paging_mode, sr_flags);
         if ( rc < 0 )
         {
             spin_unlock_irqrestore(&iommu->lock, flags);
@@ -208,6 +221,7 @@ static int __must_check amd_iommu_setup_
               * intended anyway.
               */
              !pdev->domain->is_dying &&
+             pdev->domain != dom_io &&
              (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
               pdev->phantom_stride) )
             printk(" %04x:%02x:%02x.%u: reassignment may cause %pd data corruption\n",
@@ -238,9 +252,8 @@ static int __must_check amd_iommu_setup_
     AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
                     "root table = %#"PRIx64", "
                     "domain = %d, paging mode = %d\n",
-                    req_id, pdev->type,
-                    page_to_maddr(hd->arch.root_table),
-                    domain->domain_id, hd->arch.paging_mode);
+                    req_id, pdev->type, page_to_maddr(root_pg),
+                    domid, hd->arch.paging_mode);
 
     ASSERT(pcidevs_locked());
 
@@ -313,7 +326,7 @@ static int iov_enable_xt(void)
 
 int amd_iommu_alloc_root(struct domain_iommu *hd)
 {
-    if ( unlikely(!hd->arch.root_table) )
+    if ( unlikely(!hd->arch.root_table) && hd != dom_iommu(dom_io) )
     {
         hd->arch.root_table = alloc_amd_iommu_pgtable();
         if ( !hd->arch.root_table )
@@ -404,7 +417,7 @@ static void amd_iommu_disable_domain_dev
 
         AMD_IOMMU_DEBUG("Disable: device id = %#x, "
                         "domain = %d, paging mode = %d\n",
-                        req_id,  domain->domain_id,
+                        req_id, dte->domain_id,
                         dom_iommu(domain)->arch.paging_mode);
     }
     spin_unlock_irqrestore(&iommu->lock, flags);
@@ -668,6 +681,8 @@ static int amd_iommu_remove_device(u8 de
 
     amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
 
+    amd_iommu_quarantine_teardown(pdev);
+
     iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
     pdev->arch.pseudo_domid = DOMID_INVALID;
 
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -450,21 +450,21 @@ int iommu_iotlb_flush_all(struct domain
     return rc;
 }
 
-static int __init iommu_quarantine_init(void)
+int iommu_quarantine_dev_init(device_t *dev)
 {
     const struct domain_iommu *hd = dom_iommu(dom_io);
-    int rc;
 
-    dom_io->options |= XEN_DOMCTL_CDF_iommu;
+    if ( !iommu_quarantine || !hd->platform_ops->quarantine_init )
+        return 0;
 
-    rc = iommu_domain_init(dom_io, 0);
-    if ( rc )
-        return rc;
+    return iommu_call(hd->platform_ops, quarantine_init, dev);
+}
 
-    if ( !hd->platform_ops->quarantine_init )
-        return 0;
+static int __init iommu_quarantine_init(void)
+{
+    dom_io->options |= XEN_DOMCTL_CDF_iommu;
 
-    return hd->platform_ops->quarantine_init(dom_io);
+    return iommu_domain_init(dom_io, 0);
 }
 
 int __init iommu_setup(void)
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -929,9 +929,16 @@ static int deassign_device(struct domain
         return -ENODEV;
 
     /* De-assignment from dom_io should de-quarantine the device */
-    target = ((pdev->quarantine || iommu_quarantine) &&
-              pdev->domain != dom_io) ?
-        dom_io : hardware_domain;
+    if ( (pdev->quarantine || iommu_quarantine) && pdev->domain != dom_io )
+    {
+        ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( ret )
+           return ret;
+
+        target = dom_io;
+    }
+    else
+        target = hardware_domain;
 
     while ( pdev->phantom_stride )
     {
@@ -1547,6 +1554,13 @@ static int assign_device(struct domain *
         msixtbl_init(d);
     }
 
+    if ( pdev->domain != dom_io )
+    {
+        rc = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( rc )
+            goto done;
+    }
+
     pdev->fault.count = 0;
 
     if ( (rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag)) )
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -43,6 +43,12 @@
 #include "vtd.h"
 #include "../ats.h"
 
+#define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
+                                             : (pdev)->arch.pseudo_domid)
+#define DEVICE_PGTABLE(d, pdev) ((d) != dom_io \
+                                 ? dom_iommu(d)->arch.pgd_maddr \
+                                 : (pdev)->arch.vtd.pgd_maddr)
+
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
 
@@ -78,13 +84,18 @@ static int get_iommu_did(domid_t domid,
 
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
+
+/*
+ * This function may have "context" passed as NULL, to merely obtain a DID
+ * for "domid".
+ */
 static int context_set_domain_id(struct context_entry *context,
                                  domid_t domid, struct vtd_iommu *iommu)
 {
     unsigned long nr_dom, i;
     int found = 0;
 
-    ASSERT(spin_is_locked(&iommu->lock));
+    ASSERT(pcidevs_locked());
 
     nr_dom = cap_ndoms(iommu->cap);
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
@@ -110,8 +121,13 @@ static int context_set_domain_id(struct
     }
 
     set_bit(i, iommu->domid_bitmap);
-    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
-    context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+
+    if ( context )
+    {
+        context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
+        context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+    }
+
     return 0;
 }
 
@@ -161,8 +177,12 @@ static void check_cleanup_domid_map(stru
                                     const struct pci_dev *exclude,
                                     struct vtd_iommu *iommu)
 {
-    bool found = any_pdev_behind_iommu(d, exclude, iommu);
+    bool found;
+
+    if ( d == dom_io )
+        return;
 
+    found = any_pdev_behind_iommu(d, exclude, iommu);
     /*
      * Hidden devices are associated with DomXEN but usable by the hardware
      * domain. Hence they need considering here as well.
@@ -1400,7 +1420,7 @@ int domain_context_mapping_one(
         domid = iommu->domid_map[prev_did];
         if ( domid < DOMID_FIRST_RESERVED )
             prev_dom = rcu_lock_domain_by_id(domid);
-        else if ( domid == DOMID_IO )
+        else if ( pdev ? domid == pdev->arch.pseudo_domid : domid > DOMID_MASK )
             prev_dom = rcu_lock_domain(dom_io);
         if ( !prev_dom )
         {
@@ -1577,15 +1597,12 @@ int domain_context_mapping_one(
     {
         if ( !prev_dom )
             domain_context_unmap_one(domain, iommu, bus, devfn,
-                                     domain->domain_id);
+                                     DEVICE_DOMID(domain, pdev));
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
-        {
-            hd = dom_iommu(prev_dom);
             domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
-                                       domain->domain_id,
-                                       hd->arch.pgd_maddr,
+                                       DEVICE_DOMID(prev_dom, pdev),
+                                       DEVICE_PGTABLE(prev_dom, pdev),
                                        mode & MAP_WITH_RMRR);
-        }
     }
 
     if ( prev_dom )
@@ -1602,7 +1619,7 @@ static int domain_context_mapping(struct
 {
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
-    paddr_t pgd_maddr = dom_iommu(domain)->arch.pgd_maddr;
+    paddr_t pgd_maddr = DEVICE_PGTABLE(domain, pdev);
     domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
@@ -1635,7 +1652,7 @@ static int domain_context_mapping(struct
         break;
     }
 
-    if ( domain != pdev->domain )
+    if ( domain != pdev->domain && pdev->domain != dom_io )
     {
         if ( pdev->domain->is_dying )
             mode |= MAP_OWNER_DYING;
@@ -1676,8 +1693,8 @@ static int domain_context_mapping(struct
             printk(VTDPREFIX "d%d:PCIe: map %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
-        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
+        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, pdev,
+                                         DEVICE_DOMID(domain, pdev), pgd_maddr,
                                          mode);
         if ( ret > 0 )
             ret = 0;
@@ -1701,8 +1718,8 @@ static int domain_context_mapping(struct
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
-                                         mode);
+                                         pdev, DEVICE_DOMID(domain, pdev),
+                                         pgd_maddr, mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1730,8 +1747,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1746,8 +1763,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         if ( ret )
         {
@@ -1896,7 +1913,7 @@ static const struct acpi_drhd_unit *doma
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1907,7 +1924,7 @@ static const struct acpi_drhd_unit *doma
             printk(VTDPREFIX "d%d:PCI: unmap %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( ret )
             break;
 
@@ -1930,18 +1947,12 @@ static const struct acpi_drhd_unit *doma
             break;
         }
 
+        ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                       DEVICE_DOMID(domain, pdev));
         /* PCIe to PCI/PCIx bridge */
-        if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
-        {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
-            if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
-                                               domain->domain_id);
-        }
-        else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
+        if ( !ret && pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
+            ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                           DEVICE_DOMID(domain, pdev));
 
         break;
 
@@ -1987,6 +1998,25 @@ static void iommu_domain_teardown(struct
         cleanup_domid_map(d->domain_id, drhd->iommu);
 }
 
+static void quarantine_teardown(struct pci_dev *pdev,
+                                const struct acpi_drhd_unit *drhd)
+{
+    struct page_info *pg;
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.vtd.pgd_maddr )
+        return;
+
+    while ( (pg = page_list_remove_head(&pdev->arch.pgtables_list)) )
+        free_domheap_page(pg);
+
+    pdev->arch.vtd.pgd_maddr = 0;
+
+    if ( drhd )
+        cleanup_domid_map(pdev->arch.pseudo_domid, drhd->iommu);
+}
+
 static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
                                              mfn_t mfn, unsigned int flags,
                                              unsigned int *flush_flags)
@@ -2209,6 +2239,8 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
+    quarantine_teardown(pdev, drhd);
+
     if ( drhd )
     {
         iommu_free_domid(pdev->arch.pseudo_domid,
@@ -2888,60 +2920,139 @@ static void vtd_dump_p2m_table(struct do
     vtd_dump_p2m_table_level(hd->arch.pgd_maddr, agaw_to_level(hd->arch.agaw), 0, 0);
 }
 
-static int __init intel_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(struct dma_pte *this, unsigned int level,
+                    paddr_t maddrs[6], struct pci_dev *pdev)
 {
-    struct domain_iommu *hd = dom_iommu(d);
-    struct dma_pte *parent;
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_NUM; ++i )
+    {
+        struct dma_pte *pte = &this[i], *next;
+
+        if ( !dma_pte_present(*pte) )
+        {
+            if ( !maddrs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                maddrs[level] = alloc_pgtable_maddr(1, hd->node);
+                if ( !maddrs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                page_list_add(maddr_to_page(maddrs[level]),
+                              &pdev->arch.pgtables_list);
+
+                if ( level )
+                {
+                    next = map_vtd_domain_page(maddrs[level]);
+                    rc = fill_qpt(next, level - 1, maddrs, pdev);
+                    unmap_vtd_domain_page(next);
+                }
+            }
+
+            dma_set_pte_addr(*pte, maddrs[level]);
+            dma_set_pte_readable(*pte);
+            dma_set_pte_writable(*pte);
+        }
+        else if ( level && !dma_pte_superpage(*pte) )
+        {
+            page_list_add(maddr_to_page(dma_pte_addr(*pte)),
+                          &pdev->arch.pgtables_list);
+            next = map_vtd_domain_page(dma_pte_addr(*pte));
+            rc = fill_qpt(next, level - 1, maddrs, pdev);
+            unmap_vtd_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+static int intel_iommu_quarantine_init(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    paddr_t maddr;
     unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
     unsigned int level = agaw_to_level(agaw);
+    const struct acpi_drhd_unit *drhd;
+    const struct acpi_rmrr_unit *rmrr;
+    unsigned int i, bdf;
+    bool rmrr_found = false;
+    int rc;
 
-    if ( hd->arch.pgd_maddr )
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.pgd_maddr);
+
+    if ( pdev->arch.vtd.pgd_maddr )
     {
-        ASSERT_UNREACHABLE();
+        clear_domain_page(pdev->arch.leaf_mfn);
         return 0;
     }
 
-    spin_lock(&hd->arch.mapping_lock);
+    drhd = acpi_find_matched_drhd_unit(pdev);
+    if ( !drhd )
+        return -ENODEV;
 
-    hd->arch.pgd_maddr = alloc_pgtable_maddr(1, hd->node);
-    if ( !hd->arch.pgd_maddr )
-        goto out;
+    maddr = alloc_pgtable_maddr(1, hd->node);
+    if ( !maddr )
+        return -ENOMEM;
 
-    parent = map_vtd_domain_page(hd->arch.pgd_maddr);
-    while ( level )
-    {
-        uint64_t maddr;
-        unsigned int offset;
+    rc = context_set_domain_id(NULL, pdev->arch.pseudo_domid, drhd->iommu);
 
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        maddr = alloc_pgtable_maddr(1, hd->node);
-        if ( !maddr )
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
+    hd->arch.pgd_maddr = maddr;
+
+    for_each_rmrr_device ( rmrr, bdf, i )
+    {
+        if ( rc )
             break;
 
-        for ( offset = 0; offset < PTE_NUM; offset++ )
+        if ( rmrr->segment == pdev->seg && bdf == pdev->sbdf.bdf )
         {
-            struct dma_pte *pte = &parent[offset];
+            rmrr_found = true;
 
-            dma_set_pte_addr(*pte, maddr);
-            dma_set_pte_readable(*pte);
+            rc = iommu_identity_mapping(dom_io, p2m_access_rw,
+                                        rmrr->base_address, rmrr->end_address,
+                                        0);
+            if ( rc )
+                printk(XENLOG_ERR VTDPREFIX
+                       "%04x:%02x:%02x.%u: RMRR quarantine mapping failed\n",
+                       pdev->seg, pdev->bus,
+                       PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
         }
-        iommu_sync_cache(parent, PAGE_SIZE);
+    }
 
-        unmap_vtd_domain_page(parent);
-        parent = map_vtd_domain_page(maddr);
-        level--;
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.pgd_maddr = 0;
+    pdev->arch.vtd.pgd_maddr = maddr;
+
+    if ( !rc )
+    {
+        struct dma_pte *root;
+        paddr_t maddrs[6] = {};
+
+        spin_lock(&hd->arch.mapping_lock);
+
+        root = map_vtd_domain_page(maddr);
+        rc = fill_qpt(root, level - 1, maddrs, pdev);
+        unmap_vtd_domain_page(root);
+
+        pdev->arch.leaf_mfn = maddr_to_mfn(maddrs[0]);
+
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_vtd_domain_page(parent);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    if ( rc )
+        quarantine_teardown(pdev, drhd);
 
-    /* Pages leaked in failure case */
-    return level ? -ENOMEM : 0;
+    return rc;
 }
 
 const struct iommu_ops __initconstrel intel_iommu_ops = {
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -509,7 +509,7 @@ struct vtd_iommu {
     u32 nr_pt_levels;
     u64	cap;
     u64	ecap;
-    spinlock_t lock; /* protect context, domain ids */
+    spinlock_t lock; /* protect context */
     spinlock_t register_lock; /* protect iommu register handling */
     u64 root_maddr; /* root entry machine address */
     nodeid_t node;
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -211,7 +211,7 @@ typedef int iommu_grdm_t(xen_pfn_t start
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
-    int (*quarantine_init)(struct domain *d);
+    int (*quarantine_init)(device_t *dev);
     int (*add_device)(u8 devfn, device_t *dev);
     int (*enable_device)(device_t *dev);
     int (*remove_device)(u8 devfn, device_t *dev);
@@ -331,6 +331,7 @@ int __must_check iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
 int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
+int iommu_quarantine_dev_init(device_t *dev);
 
 void iommu_share_p2m_table(struct domain *d);
 
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix (de)assign ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully updated.

Also adjust a related log message.

This is CVE-2022-26358 / part of XSA-400.

Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2411,6 +2411,10 @@ static int reassign_device_ownership(
 {
     int ret;
 
+    ret = domain_context_unmap(source, devfn, pdev);
+    if ( ret )
+        return ret;
+
     /*
      * Devices assigned to untrusted domains (here assumed to be any domU)
      * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
@@ -2447,10 +2451,6 @@ static int reassign_device_ownership(
             }
     }
 
-    ret = domain_context_unmap(source, devfn, pdev);
-    if ( ret )
-        return ret;
-
     if ( devfn == pdev->devfn && pdev->domain != dom_io )
     {
         list_move(&pdev->domain_list, &dom_io->pdev_list);
@@ -2527,9 +2527,8 @@ static int intel_iommu_assign_device(
         }
     }
 
-    ret = reassign_device_ownership(s, d, devfn, pdev);
-    if ( ret || d == dom_io )
-        return ret;
+    if ( d == dom_io )
+        return reassign_device_ownership(s, d, devfn, pdev);
 
     /* Setup rmrr identity mapping */
     for_each_rmrr_device( rmrr, bdf, i )
@@ -2542,20 +2541,37 @@ static int intel_iommu_assign_device(
                                          rmrr->end_address, flag);
             if ( ret )
             {
-                int rc;
-
-                rc = reassign_device_ownership(d, s, devfn, pdev);
                 printk(XENLOG_G_ERR VTDPREFIX
-                       " cannot map reserved region (%"PRIx64",%"PRIx64"] for Dom%d (%d)\n",
-                       rmrr->base_address, rmrr->end_address,
-                       d->domain_id, ret);
-                if ( rc )
-                {
-                    printk(XENLOG_ERR VTDPREFIX
-                           " failed to reclaim %04x:%02x:%02x.%u from %pd (%d)\n",
-                           seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), d, rc);
-                    domain_crash(d);
-                }
+                       "%pd: cannot map reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, ret);
+                break;
+            }
+        }
+    }
+
+    if ( !ret )
+        ret = reassign_device_ownership(s, d, devfn, pdev);
+
+    /* See reassign_device_ownership() for the hwdom aspect. */
+    if ( !ret || is_hardware_domain(d) )
+        return ret;
+
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment == seg &&
+             PCI_BUS(bdf) == bus &&
+             PCI_DEVFN2(bdf) == devfn )
+        {
+            int rc = iommu_identity_mapping(d, p2m_access_x,
+                                            rmrr->base_address,
+                                            rmrr->end_address, 0);
+
+            if ( rc && rc != -ENOENT )
+            {
+                printk(XENLOG_ERR VTDPREFIX
+                       "%pd: cannot unmap reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, rc);
+                domain_crash(d);
                 break;
             }
         }
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix add/remove ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully cleared.

Also switch to %pd in related log messages.

Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR")
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2000,14 +2000,6 @@ static int intel_iommu_add_device(u8 dev
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_mapping(pdev->domain, devfn, pdev);
-    if ( ret )
-    {
-        dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n",
-                pdev->domain->domain_id);
-        return ret;
-    }
-
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment == pdev->seg &&
@@ -2024,12 +2016,17 @@ static int intel_iommu_add_device(u8 dev
                                          rmrr->base_address, rmrr->end_address,
                                          0);
             if ( ret )
-                dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
-                        pdev->domain->domain_id);
+                dprintk(XENLOG_ERR VTDPREFIX, "%pd: RMRR mapping failed\n",
+                        pdev->domain);
         }
     }
 
-    return 0;
+    ret = domain_context_mapping(pdev->domain, devfn, pdev);
+    if ( ret )
+        dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n",
+                pdev->domain);
+
+    return ret;
 }
 
 static int intel_iommu_enable_device(struct pci_dev *pdev)
@@ -2051,11 +2048,15 @@ static int intel_iommu_remove_device(u8
 {
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int i;
+    int ret, i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
+    ret = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( ret )
+        return ret;
+
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment != pdev->seg ||
@@ -2071,7 +2072,7 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
-    return domain_context_unmap(pdev->domain, devfn, pdev);
+    return 0;
 }
 
 static int __hwdom_init setup_hwdom_device(u8 devfn, struct pci_dev *pdev)
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: drop ownership checking from domain_context_mapping_one()

Despite putting in quite a bit of effort it was not possible to
establish why exactly this code exists (beyond possibly sanity
checking). Instead of a subsequent change further complicating this
logic, simply get rid of it.

Take the opportunity and move the respective unmap_vtd_domain_page() out
of the locked region.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -120,28 +120,6 @@ static int context_set_domain_id(struct
     return 0;
 }
 
-static int context_get_domain_id(struct context_entry *context,
-                                 struct vtd_iommu *iommu)
-{
-    unsigned long dom_index, nr_dom;
-    int domid = -1;
-
-    if (iommu && context)
-    {
-        nr_dom = cap_ndoms(iommu->cap);
-
-        dom_index = context_domain_id(*context);
-
-        if ( dom_index < nr_dom && iommu->domid_map )
-            domid = iommu->domid_map[dom_index];
-        else
-            dprintk(XENLOG_DEBUG VTDPREFIX,
-                    "dom_index %lu exceeds nr_dom %lu or iommu has no domid_map\n",
-                    dom_index, nr_dom);
-    }
-    return domid;
-}
-
 static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
 {
     int iommu_domid = domain_iommu_domid(domain, iommu);
@@ -1395,47 +1373,9 @@ int domain_context_mapping_one(
 
     if ( context_present(*context) )
     {
-        int res = 0;
-
-        /* Try to get domain ownership from device structure.  If that's
-         * not available, try to read it from the context itself. */
-        if ( pdev )
-        {
-            if ( pdev->domain != domain )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX
-                       "%pd: %04x:%02x:%02x.%u owned by %pd\n",
-                       domain, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
-                       pdev->domain);
-                res = -EINVAL;
-            }
-        }
-        else
-        {
-            int cdomain;
-            cdomain = context_get_domain_id(context, iommu);
-            
-            if ( cdomain < 0 )
-            {
-                printk(XENLOG_G_WARNING VTDPREFIX
-                       "%pd: %04x:%02x:%02x.%u mapped, but can't find owner\n",
-                       domain, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
-                res = -EINVAL;
-            }
-            else if ( cdomain != domain->domain_id )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX
-                       "%pd: %04x:%02x:%02x.%u already mapped to d%d\n",
-                       domain,
-                       seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
-                       cdomain);
-                res = -EINVAL;
-            }
-        }
-
-        unmap_vtd_domain_page(context_entries);
         spin_unlock(&iommu->lock);
-        return res;
+        unmap_vtd_domain_page(context_entries);
+        return 0;
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: re-assign devices directly

Devices with RMRRs, due to it being unspecified how/when the specified
memory regions may get accessed, may not be left disconnected from their
respective mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than unmapping the old context and
then mapping the new one, re-assignment needs to be done in a single
step.

This is CVE-2022-26359 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any RMRRs. The
main difference is when it comes to updating context entries, which need
to be atomic when there are RMRRs. Yet atomicity can only be achieved
with CMPXCHG16B, availability of which we can't take for given.

The seemingly complicated choice of non-negative return values for
domain_context_mapping_one() is to limit code churn: This way callers
passing NULL for pdev don't need fiddling with.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -85,7 +85,8 @@ void free_pgtable_maddr(u64 maddr);
 void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(void *va);
 int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
-                               u8 bus, u8 devfn, const struct pci_dev *);
+                               uint8_t bus, uint8_t devfn,
+                               const struct pci_dev *pdev, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
                              u8 bus, u8 devfn);
 int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
@@ -105,8 +106,8 @@ int is_igd_vt_enabled_quirk(void);
 void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
-int __must_check me_wifi_quirk(struct domain *domain,
-                               u8 bus, u8 devfn, int map);
+int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
+                               uint8_t devfn, unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -116,6 +116,7 @@ static int context_set_domain_id(struct
     }
 
     set_bit(i, iommu->domid_bitmap);
+    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
     context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
     return 0;
 }
@@ -1353,15 +1354,27 @@ static void __hwdom_init intel_iommu_hwd
     }
 }
 
+/*
+ * This function returns
+ * - a negative errno value upon error,
+ * - zero upon success when previously the entry was non-present, or this isn't
+ *   the "main" request for a device (pdev == NULL), or for no-op quarantining
+ *   assignments,
+ * - positive (one) upon success when previously the entry was present and this
+ *   is the "main" request for a device (pdev != NULL).
+ */
 int domain_context_mapping_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
-    u8 bus, u8 devfn, const struct pci_dev *pdev)
+    uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
+    unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
-    struct context_entry *context, *context_entries;
+    struct context_entry *context, *context_entries, lctxt;
+    __uint128_t old;
     u64 maddr, pgd_maddr;
-    u16 seg = iommu->drhd->segment;
+    uint16_t seg = iommu->drhd->segment, prev_did = 0;
+    struct domain *prev_dom = NULL;
     int agaw, rc, ret;
     bool_t flush_dev_iotlb;
 
@@ -1370,17 +1383,32 @@ int domain_context_mapping_one(
     maddr = bus_to_context_maddr(iommu, bus);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
     context = &context_entries[devfn];
+    old = (lctxt = *context).full;
 
-    if ( context_present(*context) )
+    if ( context_present(lctxt) )
     {
-        spin_unlock(&iommu->lock);
-        unmap_vtd_domain_page(context_entries);
-        return 0;
+        domid_t domid;
+
+        prev_did = context_domain_id(lctxt);
+        domid = iommu->domid_map[prev_did];
+        if ( domid < DOMID_FIRST_RESERVED )
+            prev_dom = rcu_lock_domain_by_id(domid);
+        else if ( domid == DOMID_IO )
+            prev_dom = rcu_lock_domain(dom_io);
+        if ( !prev_dom )
+        {
+            spin_unlock(&iommu->lock);
+            unmap_vtd_domain_page(context_entries);
+            dprintk(XENLOG_DEBUG VTDPREFIX,
+                    "no domain for did %u (nr_dom %u)\n",
+                    prev_did, cap_ndoms(iommu->cap));
+            return -ESRCH;
+        }
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
     {
-        context_set_translation_type(*context, CONTEXT_TT_PASS_THRU);
+        context_set_translation_type(lctxt, CONTEXT_TT_PASS_THRU);
         agaw = level_to_agaw(iommu->nr_pt_levels);
     }
     else
@@ -1397,6 +1425,8 @@ int domain_context_mapping_one(
                 spin_unlock(&hd->arch.mapping_lock);
                 spin_unlock(&iommu->lock);
                 unmap_vtd_domain_page(context_entries);
+                if ( prev_dom )
+                    rcu_unlock_domain(prev_dom);
                 return -ENOMEM;
             }
         }
@@ -1414,33 +1444,102 @@ int domain_context_mapping_one(
                 goto nomem;
         }
 
-        context_set_address_root(*context, pgd_maddr);
+        context_set_address_root(lctxt, pgd_maddr);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
-            context_set_translation_type(*context, CONTEXT_TT_DEV_IOTLB);
+            context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
-            context_set_translation_type(*context, CONTEXT_TT_MULTI_LEVEL);
+            context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL);
 
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    if ( context_set_domain_id(context, domain, iommu) )
+    rc = context_set_domain_id(&lctxt, domain, iommu);
+    if ( rc )
     {
+    unlock:
         spin_unlock(&iommu->lock);
         unmap_vtd_domain_page(context_entries);
-        return -EFAULT;
+        if ( prev_dom )
+            rcu_unlock_domain(prev_dom);
+        return rc;
+    }
+
+    if ( !prev_dom )
+    {
+        context_set_address_width(lctxt, agaw);
+        context_set_fault_enable(lctxt);
+        context_set_present(lctxt);
+    }
+    else if ( prev_dom == domain )
+    {
+        ASSERT(lctxt.full == context->full);
+        rc = !!pdev;
+        goto unlock;
+    }
+    else
+    {
+        ASSERT(context_address_width(lctxt) == agaw);
+        ASSERT(!context_fault_disable(lctxt));
+    }
+
+    if ( cpu_has_cx16 )
+    {
+        __uint128_t res = cmpxchg16b(context, &old, &lctxt.full);
+
+        /*
+         * Hardware does not update the context entry behind our backs,
+         * so the return value should match "old".
+         */
+        if ( res != old )
+        {
+            if ( pdev )
+                check_cleanup_domid_map(domain, pdev, iommu);
+            printk(XENLOG_ERR
+                   "%04x:%02x:%02x.%u: unexpected context entry %016lx_%016lx (expected %016lx_%016lx)\n",
+                   pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                   (uint64_t)(res >> 64), (uint64_t)res,
+                   (uint64_t)(old >> 64), (uint64_t)old);
+            rc = -EILSEQ;
+            goto unlock;
+        }
+    }
+    else if ( !prev_dom || !(mode & MAP_WITH_RMRR) )
+    {
+        context_clear_present(*context);
+        iommu_sync_cache(context, sizeof(*context));
+
+        write_atomic(&context->hi, lctxt.hi);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->lo, lctxt.lo);
+    }
+    else /* Best effort, updating DID last. */
+    {
+         /*
+          * By non-atomically updating the context entry's DID field last,
+          * during a short window in time TLB entries with the old domain ID
+          * but the new page tables may be inserted.  This could affect I/O
+          * of other devices using this same (old) domain ID.  Such updating
+          * therefore is not a problem if this was the only device associated
+          * with the old domain ID.  Diverting I/O of any of a dying domain's
+          * devices to the quarantine page tables is intended anyway.
+          */
+        if ( !(mode & (MAP_OWNER_DYING | MAP_SINGLE_DEVICE)) )
+            printk(XENLOG_WARNING VTDPREFIX
+                   " %04x:%02x:%02x.%u: reassignment may cause %pd data corruption\n",
+                   seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), prev_dom);
+
+        write_atomic(&context->lo, lctxt.lo);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->hi, lctxt.hi);
     }
 
-    context_set_address_width(*context, agaw);
-    context_set_fault_enable(*context);
-    context_set_present(*context);
     iommu_sync_cache(context, sizeof(struct context_entry));
     spin_unlock(&iommu->lock);
 
-    /* Context entry was previously non-present (with domid 0). */
-    rc = iommu_flush_context_device(iommu, 0, PCI_BDF2(bus, devfn),
-                                    DMA_CCMD_MASK_NOBIT, 1);
+    rc = iommu_flush_context_device(iommu, prev_did, PCI_BDF2(bus, devfn),
+                                    DMA_CCMD_MASK_NOBIT, !prev_dom);
     flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-    ret = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
+    ret = iommu_flush_iotlb_dsi(iommu, prev_did, !prev_dom, flush_dev_iotlb);
 
     /*
      * The current logic for returns:
@@ -1461,17 +1560,26 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, mode);
 
     if ( rc )
     {
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        if ( !prev_dom )
+            ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+            ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                             mode & MAP_WITH_RMRR) < 0;
+        else
+            ret = 1;
 
         if ( !ret && pdev && pdev->devfn == devfn )
             check_cleanup_domid_map(domain, pdev, iommu);
     }
 
-    return rc;
+    if ( prev_dom )
+        rcu_unlock_domain(prev_dom);
+
+    return rc ?: pdev && prev_dom;
 }
 
 static int domain_context_unmap(struct domain *d, uint8_t devfn,
@@ -1481,8 +1589,10 @@ static int domain_context_mapping(struct
                                   struct pci_dev *pdev)
 {
     struct acpi_drhd_unit *drhd;
+    const struct acpi_rmrr_unit *rmrr;
     int ret = 0;
-    uint16_t seg = pdev->seg;
+    unsigned int i, mode = 0;
+    uint16_t seg = pdev->seg, bdf;
     uint8_t bus = pdev->bus, secbus;
 
     drhd = acpi_find_matched_drhd_unit(pdev);
@@ -1502,8 +1612,29 @@ static int domain_context_mapping(struct
 
     ASSERT(pcidevs_locked());
 
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment != pdev->seg || bdf != pdev->sbdf.bdf )
+            continue;
+
+        mode |= MAP_WITH_RMRR;
+        break;
+    }
+
+    if ( domain != pdev->domain )
+    {
+        if ( pdev->domain->is_dying )
+            mode |= MAP_OWNER_DYING;
+        else if ( drhd &&
+                  !any_pdev_behind_iommu(pdev->domain, pdev, drhd->iommu) &&
+                  !pdev->phantom_stride )
+            mode |= MAP_SINGLE_DEVICE;
+    }
+
     switch ( pdev->type )
     {
+        bool prev_present;
+
     case DEV_TYPE_PCI_HOST_BRIDGE:
         if ( iommu_debug )
             printk(VTDPREFIX "d%d:Hostbridge: skip %04x:%02x:%02x.%u map\n",
@@ -1524,7 +1655,9 @@ static int domain_context_mapping(struct
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
+                                         pdev, mode);
+        if ( ret > 0 )
+            ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             enable_ats_device(pdev, &drhd->iommu->ats_devices);
 
@@ -1537,9 +1670,10 @@ static int domain_context_mapping(struct
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
-        if ( ret )
+                                         pdev, mode);
+        if ( ret < 0 )
             break;
+        prev_present = ret;
 
         if ( (ret = find_upstream_bridge(seg, &bus, &devfn, &secbus)) < 1 )
         {
@@ -1547,6 +1681,15 @@ static int domain_context_mapping(struct
                 break;
             ret = -ENXIO;
         }
+        /*
+         * Strictly speaking if the device is the only one behind this bridge
+         * and the only one with this (secbus,0,0) tuple, it could be allowed
+         * to be re-assigned regardless of RMRR presence.  But let's deal with
+         * that case only if it is actually found in the wild.
+         */
+        else if ( prev_present && (mode & MAP_WITH_RMRR) &&
+                  domain != pdev->domain )
+            ret = -EOPNOTSUPP;
 
         /*
          * Mapping a bridge should, if anything, pass the struct pci_dev of
@@ -1555,7 +1698,7 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL);
+                                             NULL, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1570,10 +1713,15 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL);
+                                             NULL, mode);
 
         if ( ret )
-            domain_context_unmap(domain, devfn, pdev);
+        {
+            if ( !prev_present )
+                domain_context_unmap(domain, devfn, pdev);
+            else if ( pdev->domain != domain ) /* Avoid infinite recursion. */
+                domain_context_mapping(pdev->domain, devfn, pdev);
+        }
 
         break;
 
@@ -2352,9 +2500,8 @@ static int reassign_device_ownership(
 {
     int ret;
 
-    ret = domain_context_unmap(source, devfn, pdev);
-    if ( ret )
-        return ret;
+    if ( !has_arch_pdevs(target) )
+        vmx_pi_hooks_assign(target);
 
     /*
      * Devices assigned to untrusted domains (here assumed to be any domU)
@@ -2364,6 +2511,31 @@ static int reassign_device_ownership(
     if ( (target != hardware_domain) && !iommu_intremap )
         untrusted_msi = true;
 
+    ret = domain_context_mapping(target, devfn, pdev);
+    if ( ret )
+    {
+        if ( !has_arch_pdevs(target) )
+            vmx_pi_hooks_deassign(target);
+        return ret;
+    }
+
+    if ( pdev->devfn == devfn )
+    {
+        const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+
+        if ( drhd )
+            check_cleanup_domid_map(source, pdev, drhd->iommu);
+    }
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
+
+    if ( !has_arch_pdevs(source) )
+        vmx_pi_hooks_deassign(source);
+
     /*
      * If the device belongs to the hardware domain, and it has RMRR, don't
      * remove it from the hardware domain, because BIOS may use RMRR at
@@ -2392,34 +2564,7 @@ static int reassign_device_ownership(
             }
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    if ( !has_arch_pdevs(source) )
-        vmx_pi_hooks_deassign(source);
-
-    if ( !has_arch_pdevs(target) )
-        vmx_pi_hooks_assign(target);
-
-    ret = domain_context_mapping(target, devfn, pdev);
-    if ( ret )
-    {
-        if ( !has_arch_pdevs(target) )
-            vmx_pi_hooks_deassign(target);
-
-        return ret;
-    }
-
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
-    }
-
-    return ret;
+    return 0;
 }
 
 static int intel_iommu_assign_device(
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -202,8 +202,12 @@ struct root_entry {
     do {(root).val |= ((value) & PAGE_MASK_4K);} while(0)
 
 struct context_entry {
-    u64 lo;
-    u64 hi;
+    union {
+        struct {
+            uint64_t lo, hi;
+        };
+        __uint128_t full;
+    };
 };
 #define ROOT_ENTRY_NR (PAGE_SIZE_4K/sizeof(struct root_entry))
 #define context_present(c) ((c).lo & 1)
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -344,7 +344,8 @@ void __init platform_quirks_init(void)
  */
 
 static int __must_check map_me_phantom_function(struct domain *domain,
-                                                u32 dev, int map)
+                                                unsigned int dev,
+                                                unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
     struct pci_dev *pdev;
@@ -355,9 +356,9 @@ static int __must_check map_me_phantom_f
     drhd = acpi_find_matched_drhd_unit(pdev);
 
     /* map or unmap ME phantom function */
-    if ( map )
+    if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL);
+                                        PCI_DEVFN(dev, 7), NULL, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
                                       PCI_DEVFN(dev, 7));
@@ -365,7 +366,8 @@ static int __must_check map_me_phantom_f
     return rc;
 }
 
-int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
+int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
+                  unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -389,7 +391,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, map);
+                rc = map_me_phantom_function(domain, 3, mode);
                 break;
             default:
                 break;
@@ -415,7 +417,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, map);
+                rc = map_me_phantom_function(domain, 22, mode);
                 break;
             default:
                 break;
--- a/xen/drivers/passthrough/vtd/vtd.h
+++ b/xen/drivers/passthrough/vtd/vtd.h
@@ -22,8 +22,14 @@
 
 #include <xen/iommu.h>
 
-#define MAP_ME_PHANTOM_FUNC      1
-#define UNMAP_ME_PHANTOM_FUNC    0
+/*
+ * Values for domain_context_mapping_one()'s and me_wifi_quirk()'s "mode"
+ * parameters.
+ */
+#define MAP_WITH_RMRR         (1u << 0)
+#define MAP_OWNER_DYING       (1u << 1)
+#define MAP_SINGLE_DEVICE     (1u << 2)
+#define UNMAP_ME_PHANTOM_FUNC (1u << 3)
 
 /* Allow for both IOAPIC and IOSAPIC. */
 #define IO_xAPIC_route_entry IO_APIC_route_entry
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-assign devices directly

Devices with unity map ranges, due to it being unspecified how/when
these memory ranges may get accessed, may not be left disconnected from
their unity mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than tearing down the old root page
table pointer and then establishing the new one, re-assignment needs to
be done in a single step.

This is CVE-2022-26360 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any unity map
ranges.  The main difference is when it comes to updating DTEs, which need
to be atomic when there are unity mappings. Yet atomicity can only be
achieved with CMPXCHG16B, availability of which we can't take for given.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -247,9 +247,13 @@ void amd_iommu_set_intremap_table(struct
                                   const void *ptr,
                                   const struct amd_iommu *iommu,
                                   bool valid);
-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
-				   uint64_t root_ptr, uint16_t domain_id,
-				   uint8_t paging_mode, bool valid);
+#define SET_ROOT_VALID          (1u << 0)
+#define SET_ROOT_WITH_UNITY_MAP (1u << 1)
+int __must_check amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+                                               uint64_t root_ptr,
+                                               uint16_t domain_id,
+                                               uint8_t paging_mode,
+                                               unsigned int flags);
 void iommu_dte_add_device_entry(struct amd_iommu_dte *dte,
                                 const struct ivrs_mappings *ivrs_dev);
 void iommu_dte_set_guest_cr3(struct amd_iommu_dte *dte, uint16_t dom_id,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -99,10 +99,69 @@ static unsigned int set_iommu_pte_presen
     return flush_flags;
 }
 
-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
-                                   uint64_t root_ptr, uint16_t domain_id,
-                                   uint8_t paging_mode, bool valid)
+/*
+ * This function returns
+ * - -errno for errors,
+ * - 0 for a successful update, atomic when necessary
+ * - 1 for a successful but non-atomic update, which may need to be warned
+ *   about by the caller.
+ */
+int amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+                                  uint64_t root_ptr, uint16_t domain_id,
+                                  uint8_t paging_mode, unsigned int flags)
 {
+    bool valid = flags & SET_ROOT_VALID;
+
+    if ( dte->v && dte->tv &&
+         (cpu_has_cx16 || (flags & SET_ROOT_WITH_UNITY_MAP)) )
+    {
+        union {
+            struct amd_iommu_dte dte;
+            uint64_t raw64[4];
+            __uint128_t raw128[2];
+        } ldte = { .dte = *dte };
+        __uint128_t old = ldte.raw128[0];
+        int ret = 0;
+
+        ldte.dte.domain_id = domain_id;
+        ldte.dte.pt_root = paddr_to_pfn(root_ptr);
+        ldte.dte.iw = true;
+        ldte.dte.ir = true;
+        ldte.dte.paging_mode = paging_mode;
+        ldte.dte.v = valid;
+
+        if ( cpu_has_cx16 )
+        {
+            __uint128_t res = cmpxchg16b(dte, &old, &ldte.raw128[0]);
+
+            /*
+             * Hardware does not update the DTE behind our backs, so the
+             * return value should match "old".
+             */
+            if ( res != old )
+            {
+                printk(XENLOG_ERR
+                       "Dom%d: unexpected DTE %016lx_%016lx (expected %016lx_%016lx)\n",
+                       domain_id,
+                       (uint64_t)(res >> 64), (uint64_t)res,
+                       (uint64_t)(old >> 64), (uint64_t)old);
+                ret = -EILSEQ;
+            }
+        }
+        else /* Best effort, updating domain_id last. */
+        {
+            uint64_t *ptr = (void *)dte;
+
+            write_atomic(ptr + 0, ldte.raw64[0]);
+            /* No barrier should be needed between these two. */
+            write_atomic(ptr + 1, ldte.raw64[1]);
+
+            ret = 1;
+        }
+
+        return ret;
+    }
+
     if ( valid || dte->v )
     {
         dte->tv = false;
@@ -117,6 +176,8 @@ void amd_iommu_set_root_page_table(struc
     smp_wmb();
     dte->tv = true;
     dte->v = valid;
+
+    return 0;
 }
 
 void amd_iommu_set_intremap_table(
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -82,40 +82,81 @@ int get_dma_requestor_id(uint16_t seg, u
     return req_id;
 }
 
-static void amd_iommu_setup_domain_device(
+static int __must_check allocate_domain_resources(struct domain_iommu *hd)
+{
+    int rc;
+
+    spin_lock(&hd->arch.mapping_lock);
+    rc = amd_iommu_alloc_root(hd);
+    spin_unlock(&hd->arch.mapping_lock);
+
+    return rc;
+}
+
+static bool any_pdev_behind_iommu(const struct domain *d,
+                                  const struct pci_dev *exclude,
+                                  const struct amd_iommu *iommu)
+{
+    const struct pci_dev *pdev;
+
+    for_each_pdev ( d, pdev )
+    {
+        if ( pdev == exclude )
+            continue;
+
+        if ( find_iommu_for_device(pdev->seg, pdev->sbdf.bdf) == iommu )
+            return true;
+    }
+
+    return false;
+}
+
+static int __must_check amd_iommu_setup_domain_device(
     struct domain *domain, struct amd_iommu *iommu,
     uint8_t devfn, struct pci_dev *pdev)
 {
     struct amd_iommu_dte *table, *dte;
     unsigned long flags;
-    int req_id, valid = 1;
+    unsigned int req_id, sr_flags;
+    int rc;
     u8 bus = pdev->bus;
-    const struct domain_iommu *hd = dom_iommu(domain);
+    struct domain_iommu *hd = dom_iommu(domain);
+    const struct ivrs_mappings *ivrs_dev;
 
-    BUG_ON( !hd->arch.root_table || !hd->arch.paging_mode ||
-            !iommu->dev_table.buffer );
+    BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer);
 
-    if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
-        valid = 0;
+    rc = allocate_domain_resources(hd);
+    if ( rc )
+        return rc;
+
+    req_id = get_dma_requestor_id(iommu->seg, pdev->sbdf.bdf);
+    ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
+    sr_flags = (iommu_hwdom_passthrough && is_hardware_domain(domain)
+                ? 0 : SET_ROOT_VALID)
+               | (ivrs_dev->unity_map ? SET_ROOT_WITH_UNITY_MAP : 0);
 
     /* get device-table entry */
     req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
     table = iommu->dev_table.buffer;
     dte = &table[req_id];
+    ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
     spin_lock_irqsave(&iommu->lock, flags);
 
     if ( !dte->v || !dte->tv )
     {
-        const struct ivrs_mappings *ivrs_dev;
-
         /* bind DTE to domain page-tables */
-        amd_iommu_set_root_page_table(
-            dte, page_to_maddr(hd->arch.root_table), domain->domain_id,
-            hd->arch.paging_mode, valid);
+        rc = amd_iommu_set_root_page_table(
+                 dte, page_to_maddr(hd->arch.root_table),
+                 domain->domain_id, hd->arch.paging_mode, sr_flags);
+        if ( rc )
+        {
+            ASSERT(rc < 0);
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
 
         /* Undo what amd_iommu_disable_domain_device() may have done. */
-        ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
         if ( dte->it_root )
         {
             dte->int_ctl = IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED;
@@ -130,17 +171,74 @@ static void amd_iommu_setup_domain_devic
             dte->i = ats_enabled;
 
         amd_iommu_flush_device(iommu, req_id);
+    }
+    else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.root_table)) )
+    {
+        /*
+         * Strictly speaking if the device is the only one with this requestor
+         * ID, it could be allowed to be re-assigned regardless of unity map
+         * presence.  But let's deal with that case only if it is actually
+         * found in the wild.
+         */
+        if ( req_id != PCI_BDF2(bus, devfn) &&
+             (sr_flags & SET_ROOT_WITH_UNITY_MAP) )
+            rc = -EOPNOTSUPP;
+        else
+            rc = amd_iommu_set_root_page_table(
+                     dte, page_to_maddr(hd->arch.root_table),
+                     domain->domain_id, hd->arch.paging_mode, sr_flags);
+        if ( rc < 0 )
+        {
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
+        if ( rc &&
+             domain != pdev->domain &&
+             /*
+              * By non-atomically updating the DTE's domain ID field last,
+              * during a short window in time TLB entries with the old domain
+              * ID but the new page tables may have been inserted.  This could
+              * affect I/O of other devices using this same (old) domain ID.
+              * Such updating therefore is not a problem if this was the only
+              * device associated with the old domain ID.  Diverting I/O of any
+              * of a dying domain's devices to the quarantine page tables is
+              * intended anyway.
+              */
+             !pdev->domain->is_dying &&
+             (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
+              pdev->phantom_stride) )
+            printk(" %04x:%02x:%02x.%u: reassignment may cause %pd data corruption\n",
+                   pdev->seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                   pdev->domain);
+
+        /*
+         * Check remaining settings are still in place from an earlier call
+         * here. They're all independent of the domain, so should not have
+         * changed.
+         */
+        if ( dte->it_root )
+            ASSERT(dte->int_ctl == IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED);
+        ASSERT(dte->iv == iommu_intremap);
+        ASSERT(dte->ex == ivrs_dev->dte_allow_exclusion);
+        ASSERT(dte->sys_mgt == MASK_EXTR(ivrs_dev->device_flags,
+                                         ACPI_IVHD_SYSTEM_MGMT));
 
-        AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
-                        "root table = %#"PRIx64", "
-                        "domain = %d, paging mode = %d\n",
-                        req_id, pdev->type,
-                        page_to_maddr(hd->arch.root_table),
-                        domain->domain_id, hd->arch.paging_mode);
+        if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
+             iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) )
+            ASSERT(dte->i == ats_enabled);
+
+        amd_iommu_flush_device(iommu, req_id);
     }
 
     spin_unlock_irqrestore(&iommu->lock, flags);
 
+    AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
+                    "root table = %#"PRIx64", "
+                    "domain = %d, paging mode = %d\n",
+                    req_id, pdev->type,
+                    page_to_maddr(hd->arch.root_table),
+                    domain->domain_id, hd->arch.paging_mode);
+
     ASSERT(pcidevs_locked());
 
     if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
@@ -151,6 +249,8 @@ static void amd_iommu_setup_domain_devic
 
         amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
     }
+
+    return 0;
 }
 
 int __init acpi_ivrs_init(void)
@@ -214,17 +314,6 @@ int amd_iommu_alloc_root(struct domain_i
     return 0;
 }
 
-static int __must_check allocate_domain_resources(struct domain_iommu *hd)
-{
-    int rc;
-
-    spin_lock(&hd->arch.mapping_lock);
-    rc = amd_iommu_alloc_root(hd);
-    spin_unlock(&hd->arch.mapping_lock);
-
-    return rc;
-}
-
 int __read_mostly amd_iommu_min_paging_mode = 1;
 
 static int amd_iommu_domain_init(struct domain *d)
@@ -324,7 +413,6 @@ static int reassign_device(struct domain
 {
     struct amd_iommu *iommu;
     int bdf, rc;
-    struct domain_iommu *t = dom_iommu(target);
     const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
 
     bdf = PCI_BDF2(pdev->bus, pdev->devfn);
@@ -338,7 +426,15 @@ static int reassign_device(struct domain
         return -ENODEV;
     }
 
-    amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
+    rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
+    if ( rc )
+        return rc;
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
 
     /*
      * If the device belongs to the hardware domain, and it has a unity mapping,
@@ -354,27 +450,10 @@ static int reassign_device(struct domain
             return rc;
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    rc = allocate_domain_resources(t);
-    if ( rc )
-        return rc;
-
-    amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
     AMD_IOMMU_DEBUG("Re-assign %04x:%02x:%02x.%u from dom%d to dom%d\n",
                     pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                     source->domain_id, target->domain_id);
 
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
-    }
-
     return 0;
 }
 
@@ -538,8 +617,7 @@ static int amd_iommu_add_device(u8 devfn
         spin_unlock_irqrestore(&iommu->lock, flags);
     }
 
-    amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
-    return 0;
+    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part I)

Arrange for domain ID and page table root to be passed around, the latter in
particular to domain_pgd_maddr() such that taking it from the per-domain
fields can be overridden.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -86,9 +86,10 @@ void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(void *va);
 int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
                                uint8_t bus, uint8_t devfn,
-                               const struct pci_dev *pdev, unsigned int mode);
+                               const struct pci_dev *pdev, domid_t domid,
+                               paddr_t pgd_maddr, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
-                             u8 bus, u8 devfn);
+                             uint8_t bus, uint8_t devfn, domid_t domid);
 int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 
 unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
@@ -107,7 +108,8 @@ void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
 int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
-                               uint8_t devfn, unsigned int mode);
+                               uint8_t devfn, domid_t domid, paddr_t pgd_maddr,
+                               unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1367,12 +1367,12 @@ int domain_context_mapping_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
     uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
-    unsigned int mode)
+    domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     struct context_entry *context, *context_entries, lctxt;
     __uint128_t old;
-    u64 maddr, pgd_maddr;
+    uint64_t maddr;
     uint16_t seg = iommu->drhd->segment, prev_did = 0;
     struct domain *prev_dom = NULL;
     int agaw, rc, ret;
@@ -1413,10 +1413,12 @@ int domain_context_mapping_one(
     }
     else
     {
+        paddr_t root = pgd_maddr;
+
         spin_lock(&hd->arch.mapping_lock);
 
         /* Ensure we have pagetables allocated down to leaf PTE. */
-        if ( hd->arch.pgd_maddr == 0 )
+        if ( !root )
         {
             addr_to_dma_page_maddr(domain, 0, 1);
             if ( hd->arch.pgd_maddr == 0 )
@@ -1429,22 +1431,24 @@ int domain_context_mapping_one(
                     rcu_unlock_domain(prev_dom);
                 return -ENOMEM;
             }
+
+            root = hd->arch.pgd_maddr;
         }
 
         /* Skip top levels of page tables for 2- and 3-level DRHDs. */
-        pgd_maddr = hd->arch.pgd_maddr;
         for ( agaw = level_to_agaw(4);
               agaw != level_to_agaw(iommu->nr_pt_levels);
               agaw-- )
         {
-            struct dma_pte *p = map_vtd_domain_page(pgd_maddr);
-            pgd_maddr = dma_pte_addr(*p);
+            struct dma_pte *p = map_vtd_domain_page(root);
+
+            root = dma_pte_addr(*p);
             unmap_vtd_domain_page(p);
-            if ( pgd_maddr == 0 )
+            if ( !root )
                 goto nomem;
         }
 
-        context_set_address_root(lctxt, pgd_maddr);
+        context_set_address_root(lctxt, root);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
             context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
@@ -1560,15 +1564,21 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, mode);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode);
 
     if ( rc )
     {
         if ( !prev_dom )
-            ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+            ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                           domain->domain_id);
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+        {
+            hd = dom_iommu(prev_dom);
             ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                             domain->domain_id,
+                                             hd->arch.pgd_maddr,
                                              mode & MAP_WITH_RMRR) < 0;
+        }
         else
             ret = 1;
 
@@ -1590,6 +1600,7 @@ static int domain_context_mapping(struct
 {
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
+    paddr_t pgd_maddr = dom_iommu(domain)->arch.pgd_maddr;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1655,7 +1666,8 @@ static int domain_context_mapping(struct
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret > 0 )
             ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
@@ -1670,7 +1682,8 @@ static int domain_context_mapping(struct
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1698,7 +1711,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1713,7 +1727,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         if ( ret )
         {
@@ -1742,7 +1757,7 @@ static int domain_context_mapping(struct
 int domain_context_unmap_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
-    u8 bus, u8 devfn)
+    uint8_t bus, uint8_t devfn, domid_t domid)
 {
     struct context_entry *context, *context_entries;
     u64 maddr;
@@ -1800,7 +1815,7 @@ int domain_context_unmap_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !iommu->drhd->segment && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, 0, UNMAP_ME_PHANTOM_FUNC);
 
     if ( rc && !is_hardware_domain(domain) && domain != dom_io )
     {
@@ -1853,7 +1868,8 @@ static int domain_context_unmap(struct d
             printk(VTDPREFIX "d%d:PCIe: unmap %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1863,7 +1879,8 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "d%d:PCI: unmap %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( ret )
             break;
 
@@ -1889,12 +1906,15 @@ static int domain_context_unmap(struct d
         /* PCIe to PCI/PCIx bridge */
         if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
         {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
             if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0);
+                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                               domain->domain_id);
         }
         else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
 
         break;
 
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -345,6 +345,8 @@ void __init platform_quirks_init(void)
 
 static int __must_check map_me_phantom_function(struct domain *domain,
                                                 unsigned int dev,
+                                                domid_t domid,
+                                                paddr_t pgd_maddr,
                                                 unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
@@ -358,16 +360,17 @@ static int __must_check map_me_phantom_f
     /* map or unmap ME phantom function */
     if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL, mode);
+                                        PCI_DEVFN(dev, 7), NULL,
+                                        domid, pgd_maddr, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
-                                      PCI_DEVFN(dev, 7));
+                                      PCI_DEVFN(dev, 7), domid);
 
     return rc;
 }
 
 int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
-                  unsigned int mode)
+                  domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -391,7 +394,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, mode);
+                rc = map_me_phantom_function(domain, 3, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
@@ -417,7 +420,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, mode);
+                rc = map_me_phantom_function(domain, 22, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part II)

Replace the passing of struct domain * by domid_t in preparation of
per-device quarantine page tables also requiring per-device pseudo
domain IDs, which aren't going to be associated with any struct domain
instances.

No functional change intended (except for slightly adjusted log message
text).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -58,8 +58,8 @@ static struct tasklet vtd_fault_tasklet;
 static int setup_hwdom_device(u8 devfn, struct pci_dev *);
 static void setup_hwdom_rmrr(struct domain *d);
 
-static int domain_iommu_domid(struct domain *d,
-                              struct vtd_iommu *iommu)
+static int get_iommu_did(domid_t domid, const struct vtd_iommu *iommu,
+                         bool warn)
 {
     unsigned long nr_dom, i;
 
@@ -67,16 +67,16 @@ static int domain_iommu_domid(struct dom
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
             return i;
 
         i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
     }
 
-    if ( !d->is_dying )
+    if ( warn )
         dprintk(XENLOG_ERR VTDPREFIX,
-                "Cannot get valid iommu %u domid: %pd\n",
-                iommu->index, d);
+                "No valid iommu %u domid for Dom%d\n",
+                iommu->index, domid);
 
     return -1;
 }
@@ -84,8 +84,7 @@ static int domain_iommu_domid(struct dom
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
 static int context_set_domain_id(struct context_entry *context,
-                                 struct domain *d,
-                                 struct vtd_iommu *iommu)
+                                 domid_t domid, struct vtd_iommu *iommu)
 {
     unsigned long nr_dom, i;
     int found = 0;
@@ -96,7 +95,7 @@ static int context_set_domain_id(struct
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
         {
             found = 1;
             break;
@@ -112,7 +111,7 @@ static int context_set_domain_id(struct
             dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: no free domain ids\n");
             return -EFAULT;
         }
-        iommu->domid_map[i] = d->domain_id;
+        iommu->domid_map[i] = domid;
     }
 
     set_bit(i, iommu->domid_bitmap);
@@ -121,9 +120,9 @@ static int context_set_domain_id(struct
     return 0;
 }
 
-static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
+static void cleanup_domid_map(domid_t domid, struct vtd_iommu *iommu)
 {
-    int iommu_domid = domain_iommu_domid(domain, iommu);
+    int iommu_domid = get_iommu_did(domid, iommu, false);
 
     if ( iommu_domid >= 0 )
     {
@@ -179,7 +178,7 @@ static void check_cleanup_domid_map(stru
     if ( !found )
     {
         clear_bit(iommu->index, &dom_iommu(d)->arch.iommu_bitmap);
-        cleanup_domid_map(d, iommu);
+        cleanup_domid_map(d->domain_id, iommu);
     }
 }
 
@@ -636,7 +635,7 @@ static int __must_check iommu_flush_iotl
             continue;
 
         flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-        iommu_domid= domain_iommu_domid(d, iommu);
+        iommu_domid = get_iommu_did(d->domain_id, iommu, !d->is_dying);
         if ( iommu_domid == -1 )
             continue;
 
@@ -1457,7 +1456,7 @@ int domain_context_mapping_one(
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    rc = context_set_domain_id(&lctxt, domain, iommu);
+    rc = context_set_domain_id(&lctxt, domid, iommu);
     if ( rc )
     {
     unlock:
@@ -1782,7 +1781,7 @@ int domain_context_unmap_one(
     context_clear_entry(*context);
     iommu_sync_cache(context, sizeof(struct context_entry));
 
-    iommu_domid= domain_iommu_domid(domain, iommu);
+    iommu_domid = get_iommu_did(domid, iommu, !domain->is_dying);
     if ( iommu_domid == -1 )
     {
         spin_unlock(&iommu->lock);
@@ -1954,7 +1953,7 @@ static void iommu_domain_teardown(struct
     spin_unlock(&hd->arch.mapping_lock);
 
     for_each_drhd_unit ( drhd )
-        cleanup_domid_map(d, drhd->iommu);
+        cleanup_domid_map(d->domain_id, drhd->iommu);
 }
 
 static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: maintain a per-device pseudo domain ID

In order to subsequently enable per-device quarantine page tables, we'll
need domain-ID-like identifiers to be inserted in the respective device
(AMD) or context (Intel) table entries alongside the per-device page
table root addresses.

Make use of "real" domain IDs occupying only half of the value range
coverable by domid_t.

Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
leaks in case of error, but existing ones don't get plugged - that'll be
the subject of a later change.

The VT-d changes are slightly asymmetric, but this way we can avoid
assigning pseudo domain IDs to devices which would never be mapped while
still avoiding to add a new parameter to domain_context_unmap().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -130,6 +130,10 @@ int pi_update_irte(const struct pi_desc
         iommu_vcall(ops, sync_cache, addr, size);       \
 })
 
+unsigned long *iommu_init_domid(void);
+domid_t iommu_alloc_domid(unsigned long *map);
+void iommu_free_domid(domid_t domid, unsigned long *map);
+
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
  * Local variables:
--- a/xen/include/asm-x86/pci.h
+++ b/xen/include/asm-x86/pci.h
@@ -15,6 +15,12 @@
 
 struct arch_pci_dev {
     vmask_t used_vectors;
+    /*
+     * These fields are (de)initialized under pcidevs-lock. Other uses of
+     * them don't race (de)initialization and hence don't strictly need any
+     * locking.
+     */
+    domid_t pseudo_domid;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -96,6 +96,7 @@ struct amd_iommu {
     struct ring_buffer cmd_buffer;
     struct ring_buffer event_log;
     struct ring_buffer ppr_log;
+    unsigned long *domid_map;
 
     int exclusion_enable;
     int exclusion_allow_all;
--- a/xen/drivers/passthrough/amd/iommu_detect.c
+++ b/xen/drivers/passthrough/amd/iommu_detect.c
@@ -180,6 +180,11 @@ int __init amd_iommu_detect_one_acpi(
     if ( rt )
         goto out;
 
+    iommu->domid_map = iommu_init_domid();
+    rt = -ENOMEM;
+    if ( !iommu->domid_map )
+        goto out;
+
     rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
     if ( rt )
         printk(XENLOG_ERR
@@ -191,7 +196,10 @@ int __init amd_iommu_detect_one_acpi(
 
  out:
     if ( rt )
+    {
+        xfree(iommu->domid_map);
         xfree(iommu);
+    }
 
     return rt;
 }
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -554,6 +554,8 @@ static int amd_iommu_add_device(u8 devfn
     struct amd_iommu *iommu;
     u16 bdf;
     struct ivrs_mappings *ivrs_mappings;
+    bool fresh_domid = false;
+    int ret;
 
     if ( !pdev->domain )
         return -EINVAL;
@@ -617,7 +619,22 @@ static int amd_iommu_add_device(u8 devfn
         spin_unlock_irqrestore(&iommu->lock, flags);
     }
 
-    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( iommu_quarantine && pdev->arch.pseudo_domid == DOMID_INVALID )
+    {
+        pdev->arch.pseudo_domid = iommu_alloc_domid(iommu->domid_map);
+        if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+            return -ENOSPC;
+        fresh_domid = true;
+    }
+
+    ret = amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( ret && fresh_domid )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
+    return ret;
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
@@ -642,6 +659,9 @@ static int amd_iommu_remove_device(u8 de
 
     amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
 
+    iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+    pdev->arch.pseudo_domid = DOMID_INVALID;
+
     ivrs_mappings = get_ivrs_mappings(pdev->seg);
     bdf = PCI_BDF2(pdev->bus, devfn);
     if ( amd_iommu_perdev_intremap &&
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -339,6 +339,7 @@ static struct pci_dev *alloc_pdev(struct
     *((u8*) &pdev->bus) = bus;
     *((u8*) &pdev->devfn) = devfn;
     pdev->domain = NULL;
+    pdev->arch.pseudo_domid = DOMID_INVALID;
     INIT_LIST_HEAD(&pdev->msi_list);
 
     pos = pci_find_cap_offset(pseg->nr, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
@@ -1349,9 +1350,13 @@ static int _dump_pci_devices(struct pci_
 
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
     {
-        printk("%04x:%02x:%02x.%u - %pd - node %-3d - MSIs < ",
-               pseg->nr, pdev->bus,
-               PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn), pdev->domain,
+        printk("%04x:%02x:%02x.%u - ", pseg->nr, pdev->bus,
+               PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+        if ( pdev->domain == dom_io )
+            printk("DomIO:%x", pdev->arch.pseudo_domid);
+        else
+            printk("%pd", pdev->domain);
+        printk(" - node %-3d - MSIs < ",
                (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
         list_for_each_entry ( msi, &pdev->msi_list, list )
                printk("%d ", msi->irq);
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -22,6 +22,7 @@
 #include <xen/sched.h>
 #include <xen/xmalloc.h>
 #include <xen/domain_page.h>
+#include <xen/err.h>
 #include <xen/iocap.h>
 #include <xen/iommu.h>
 #include <xen/numa.h>
@@ -1195,7 +1196,7 @@ int __init iommu_alloc(struct acpi_drhd_
 {
     struct vtd_iommu *iommu;
     unsigned long sagaw, nr_dom;
-    int agaw;
+    int agaw, rc;
 
     if ( nr_iommus >= MAX_IOMMUS )
     {
@@ -1288,7 +1289,16 @@ int __init iommu_alloc(struct acpi_drhd_
     if ( !iommu->domid_map )
         return -ENOMEM;
 
+    iommu->pseudo_domid_map = iommu_init_domid();
+    rc = -ENOMEM;
+    if ( !iommu->pseudo_domid_map )
+        goto free;
+
     return 0;
+
+ free:
+    iommu_free(drhd);
+    return rc;
 }
 
 void __init iommu_free(struct acpi_drhd_unit *drhd)
@@ -1311,6 +1321,7 @@ void __init iommu_free(struct acpi_drhd_
 
     xfree(iommu->domid_bitmap);
     xfree(iommu->domid_map);
+    xfree(iommu->pseudo_domid_map);
 
     if ( iommu->msi.irq >= 0 )
         destroy_irq(iommu->msi.irq);
@@ -1591,8 +1602,8 @@ int domain_context_mapping_one(
     return rc ?: pdev && prev_dom;
 }
 
-static int domain_context_unmap(struct domain *d, uint8_t devfn,
-                                struct pci_dev *pdev);
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *d, uint8_t devfn, struct pci_dev *pdev);
 
 static int domain_context_mapping(struct domain *domain, u8 devfn,
                                   struct pci_dev *pdev)
@@ -1600,6 +1611,7 @@ static int domain_context_mapping(struct
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
     paddr_t pgd_maddr = dom_iommu(domain)->arch.pgd_maddr;
+    domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1660,6 +1672,14 @@ static int domain_context_mapping(struct
         break;
 
     case DEV_TYPE_PCIe_ENDPOINT:
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "d%d:PCIe: map %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus,
@@ -1675,6 +1695,14 @@ static int domain_context_mapping(struct
         break;
 
     case DEV_TYPE_PCI:
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "d%d:PCI: map %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus,
@@ -1750,6 +1778,13 @@ static int domain_context_mapping(struct
     if ( !ret && devfn == pdev->devfn )
         pci_vtd_quirk(pdev);
 
+    if ( ret && drhd && orig_domid == DOMID_INVALID )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return ret;
 }
 
@@ -1832,8 +1867,10 @@ int domain_context_unmap_one(
     return rc;
 }
 
-static int domain_context_unmap(struct domain *domain, u8 devfn,
-                                struct pci_dev *pdev)
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *domain,
+    uint8_t devfn,
+    struct pci_dev *pdev)
 {
     struct acpi_drhd_unit *drhd;
     struct vtd_iommu *iommu;
@@ -1843,7 +1880,7 @@ static int domain_context_unmap(struct d
 
     drhd = acpi_find_matched_drhd_unit(pdev);
     if ( !drhd )
-        return -ENODEV;
+        return ERR_PTR(-ENODEV);
     iommu = drhd->iommu;
 
     switch ( pdev->type )
@@ -1854,7 +1891,7 @@ static int domain_context_unmap(struct d
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
         if ( !is_hardware_domain(domain) )
-            return -EPERM;
+            return ERR_PTR(-EPERM);
         goto out;
 
     case DEV_TYPE_PCIe_BRIDGE:
@@ -1929,7 +1966,7 @@ static int domain_context_unmap(struct d
         check_cleanup_domid_map(domain, pdev, iommu);
 
 out:
-    return ret;
+    return ret ? ERR_PTR(ret) : drhd;
 }
 
 static void iommu_domain_teardown(struct domain *d)
@@ -2153,16 +2190,17 @@ static int intel_iommu_enable_device(str
 
 static int intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
 {
+    const struct acpi_drhd_unit *drhd;
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int ret, i;
+    unsigned int i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_unmap(pdev->domain, devfn, pdev);
-    if ( ret )
-        return ret;
+    drhd = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( IS_ERR(drhd) )
+        return PTR_ERR(drhd);
 
     for_each_rmrr_device ( rmrr, bdf, i )
     {
@@ -2179,6 +2217,13 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
+    if ( drhd )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return 0;
 }
 
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -535,6 +535,7 @@ struct vtd_iommu {
     } flush;
 
     struct list_head ats_devices;
+    unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */
     unsigned long *domid_bitmap;  /* domain id bitmap */
     u16 *domid_map;               /* domain id mapping array */
     uint32_t version;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -375,6 +375,53 @@ void __hwdom_init arch_iommu_hwdom_init(
         return;
 }
 
+unsigned long *__init iommu_init_domid(void)
+{
+    if ( !iommu_quarantine )
+        return ZERO_BLOCK_PTR;
+
+    BUILD_BUG_ON(DOMID_MASK * 2U >= UINT16_MAX);
+
+    return xzalloc_array(unsigned long,
+                         BITS_TO_LONGS(UINT16_MAX - DOMID_MASK));
+}
+
+domid_t iommu_alloc_domid(unsigned long *map)
+{
+    /*
+     * This is used uniformly across all IOMMUs, such that on typical
+     * systems we wouldn't re-use the same ID very quickly (perhaps never).
+     */
+    static unsigned int start;
+    unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start);
+
+    ASSERT(pcidevs_locked());
+
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK);
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        return DOMID_INVALID;
+
+    __set_bit(idx, map);
+
+    start = idx + 1;
+
+    return idx | (DOMID_MASK + 1);
+}
+
+void iommu_free_domid(domid_t domid, unsigned long *map)
+{
+    ASSERT(pcidevs_locked());
+
+    if ( domid == DOMID_INVALID )
+        return;
+
+    ASSERT(domid > DOMID_MASK);
+
+    if ( !__test_and_clear_bit(domid & DOMID_MASK, map) )
+        BUG();
+}
+
 /*
  * Local variables:
  * mode: C
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: drop TLB flushes from quarantine_init() hooks

The page tables just created aren't hooked up yet anywhere, so there's
nothing that could be present in any TLB, and hence nothing to flush.
Dropping this flush is, at least on the VT-d side, a prereq to per-
device domain ID use when quarantining devices, as dom_io isn't going
to be assigned a DID anymore: The warning in get_iommu_did() would
trigger.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -585,8 +585,6 @@ int __init amd_iommu_quarantine_init(str
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    amd_iommu_flush_all_pages(d);
-
     /* Pages leaked in failure case */
     return level ? -ENOMEM : 0;
 }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2914,7 +2914,6 @@ static int __init intel_iommu_quarantine
     struct dma_pte *parent;
     unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
     unsigned int level = agaw_to_level(agaw);
-    int rc;
 
     if ( hd->arch.pgd_maddr )
     {
@@ -2961,10 +2960,8 @@ static int __init intel_iommu_quarantine
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    rc = iommu_flush_iotlb_all(d);
-
     /* Pages leaked in failure case */
-    return level ? -ENOMEM : rc;
+    return level ? -ENOMEM : 0;
 }
 
 const struct iommu_ops __initconstrel intel_iommu_ops = {
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: abstract maximum number of page table levels

We will want to use the constant elsewhere.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -362,7 +362,7 @@ static inline int amd_iommu_get_paging_m
     while ( max_frames > PTE_PER_TABLE_SIZE )
     {
         max_frames = PTE_PER_TABLE_ALIGN(max_frames) >> PTE_PER_TABLE_SHIFT;
-        if ( ++level > 6 )
+        if ( ++level > IOMMU_MAX_PT_LEVELS )
             return -ENOMEM;
     }
 
--- a/xen/drivers/passthrough/amd/iommu-defs.h
+++ b/xen/drivers/passthrough/amd/iommu-defs.h
@@ -106,6 +106,7 @@ struct amd_iommu_dte {
     bool tv:1;
     unsigned int :5;
     unsigned int had:2;
+#define IOMMU_MAX_PT_LEVELS 6
     unsigned int paging_mode:3;
     uint64_t pt_root:40;
     bool ppr:1;
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -256,7 +256,7 @@ static int iommu_pde_from_dfn(struct dom
     table = hd->arch.root_table;
     level = hd->arch.paging_mode;
 
-    BUG_ON( table == NULL || level < 1 || level > 6 );
+    BUG_ON( table == NULL || level < 1 || level > IOMMU_MAX_PT_LEVELS );
 
     /*
      * A frame number past what the current page tables can represent can't
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: use per-device page tables for quarantining

Devices with RMRRs / unity mapped regions, due to it being unspecified
how/when these memory regions may be accessed, may not be left
disconnected from the mappings of these regions (as long as it's not
certain that the device has been fully quiesced). Hence even the page
tables used when quarantining such devices need to have mappings of
those regions. This implies installing page tables in the first place
even when not in scratch-page quarantining mode.

This is CVE-2022-26361 / part of XSA-400.

While for the purpose here it would be sufficient to have devices with
RMRRs / unity mapped regions use per-device page tables, extend this to
all devices (in scratch-page quarantining mode). This allows the leaf
pages to be mapped r/w, thus covering also memory writes (rather than
just reads) issued by non-quiescent devices.

Set up quarantine page tables as late as possible, yet early enough to
not encounter failure during de-assign. This means setup generally
happens in assign_device(), while (for now) the one in deassign_device()
is there mainly to be on the safe side.

In VT-d's DID allocation function don't require the IOMMU lock to be
held anymore: All involved code paths hold pcidevs_lock, so this way we
avoid the need to acquire the IOMMU lock around the new call to
context_set_domain_id().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1455,7 +1455,7 @@ int set_identity_p2m_entry(struct domain
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int ret;
 
-    if ( !paging_mode_translate(p2m->domain) )
+    if ( !paging_mode_translate(d) )
     {
         if ( !is_iommu_enabled(d) )
             return 0;
--- a/xen/include/asm-x86/pci.h
+++ b/xen/include/asm-x86/pci.h
@@ -1,6 +1,8 @@
 #ifndef __X86_PCI_H__
 #define __X86_PCI_H__
 
+#include <xen/mm.h>
+
 #define CF8_BDF(cf8)     (  ((cf8) & 0x00ffff00) >> 8)
 #define CF8_ADDR_LO(cf8) (   (cf8) & 0x000000fc)
 #define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
@@ -20,7 +22,18 @@ struct arch_pci_dev {
      * them don't race (de)initialization and hence don't strictly need any
      * locking.
      */
+    union {
+        /* Subset of struct arch_iommu's fields, to be used in dom_io. */
+        struct {
+            uint64_t pgd_maddr;
+        } vtd;
+        struct {
+            struct page_info *root_table;
+        } amd;
+    };
     domid_t pseudo_domid;
+    mfn_t leaf_mfn;
+    struct page_list_head pgtables_list;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -223,7 +223,8 @@ int amd_iommu_init_late(void);
 int amd_iommu_update_ivrs_mapping_acpi(void);
 int iov_adjust_irq_affinities(void);
 
-int amd_iommu_quarantine_init(struct domain *d);
+int amd_iommu_quarantine_init(struct pci_dev *pdev);
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev);
 
 /* mapping functions */
 int __must_check amd_iommu_map_page(struct domain *d, dfn_t dfn,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -529,64 +529,137 @@ int amd_iommu_reserve_domain_unity_unmap
     return rc;
 }
 
-int __init amd_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(union amd_iommu_pte *this, unsigned int level,
+                    struct page_info *pgs[IOMMU_MAX_PT_LEVELS],
+                    struct pci_dev *pdev)
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_PER_TABLE_SIZE; ++i )
+    {
+        union amd_iommu_pte *pte = &this[i], *next;
+
+        if ( !pte->pr )
+        {
+            if ( !pgs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                pgs[level] = alloc_amd_iommu_pgtable();
+                if ( !pgs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                page_list_add(pgs[level], &pdev->arch.pgtables_list);
+
+                if ( level )
+                {
+                    next = __map_domain_page(pgs[level]);
+                    rc = fill_qpt(next, level - 1, pgs, pdev);
+                    unmap_domain_page(next);
+                }
+            }
+
+            /*
+             * PDEs are essentially a subset of PTEs, so this function
+             * is fine to use even at the leaf.
+             */
+            set_iommu_pde_present(pte, mfn_x(page_to_mfn(pgs[level])), level,
+                                  true, true);
+        }
+        else if ( level && pte->next_level )
+        {
+            page_list_add(mfn_to_page(_mfn(pte->mfn)),
+                          &pdev->arch.pgtables_list);
+            next = map_domain_page(_mfn(pte->mfn));
+            rc = fill_qpt(next, level - 1, pgs, pdev);
+            unmap_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+int amd_iommu_quarantine_init(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
     unsigned long end_gfn =
         1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT);
     unsigned int level = amd_iommu_get_paging_mode(end_gfn);
-    union amd_iommu_pte *table;
+    unsigned int req_id = get_dma_requestor_id(pdev->seg, pdev->sbdf.bdf);
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
+    int rc;
 
-    if ( hd->arch.root_table )
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.root_table);
+
+    ASSERT(pdev->arch.pseudo_domid != DOMID_INVALID);
+
+    if ( pdev->arch.amd.root_table )
     {
-        ASSERT_UNREACHABLE();
+        clear_domain_page(pdev->arch.leaf_mfn);
         return 0;
     }
 
-    spin_lock(&hd->arch.mapping_lock);
-
-    hd->arch.root_table = alloc_amd_iommu_pgtable();
-    if ( !hd->arch.root_table )
-        goto out;
-
-    table = __map_domain_page(hd->arch.root_table);
-    while ( level )
+    pdev->arch.amd.root_table = alloc_amd_iommu_pgtable();
+    if ( !pdev->arch.amd.root_table )
+        return -ENOMEM;
+
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
+    hd->arch.root_table = pdev->arch.amd.root_table;
+
+    rc = amd_iommu_reserve_domain_unity_map(dom_io,
+                                            ivrs_mappings[req_id].unity_map,
+                                            0);
+
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.root_table = NULL;
+
+    if ( rc )
+        printk("%04x:%02x:%02x.%u: quarantine unity mapping failed\n",
+               pdev->seg, pdev->bus,
+               PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
+    else
     {
-        struct page_info *pg;
-        unsigned int i;
+        union amd_iommu_pte *root;
+        struct page_info *pgs[IOMMU_MAX_PT_LEVELS] = {};
 
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        pg = alloc_amd_iommu_pgtable();
-        if ( !pg )
-            break;
+        spin_lock(&hd->arch.mapping_lock);
 
-        for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
-        {
-            union amd_iommu_pte *pde = &table[i];
+        root = __map_domain_page(pdev->arch.amd.root_table);
+        rc = fill_qpt(root, level - 1, pgs, pdev);
+        unmap_domain_page(root);
 
-            /*
-             * PDEs are essentially a subset of PTEs, so this function
-             * is fine to use even at the leaf.
-             */
-            set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level - 1,
-                                  false, true);
-        }
+        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
 
-        unmap_domain_page(table);
-        table = __map_domain_page(pg);
-        level--;
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_domain_page(table);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    if ( rc )
+        amd_iommu_quarantine_teardown(pdev);
+
+    return rc;
+}
+
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
+{
+    struct page_info *pg;
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.amd.root_table )
+        return;
+
+    while ( (pg = page_list_remove_head(&pdev->arch.pgtables_list)) )
+        free_amd_iommu_pgtable(pg);
 
-    /* Pages leaked in failure case */
-    return level ? -ENOMEM : 0;
+    pdev->arch.amd.root_table = NULL;
 }
 
 /*
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -122,6 +122,8 @@ static int __must_check amd_iommu_setup_
     u8 bus = pdev->bus;
     struct domain_iommu *hd = dom_iommu(domain);
     const struct ivrs_mappings *ivrs_dev;
+    const struct page_info *root_pg;
+    domid_t domid;
 
     BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer);
 
@@ -141,14 +143,25 @@ static int __must_check amd_iommu_setup_
     dte = &table[req_id];
     ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
+    if ( domain != dom_io )
+    {
+        root_pg = hd->arch.root_table;
+        domid = domain->domain_id;
+    }
+    else
+    {
+        root_pg = pdev->arch.amd.root_table;
+        domid = pdev->arch.pseudo_domid;
+    }
+
     spin_lock_irqsave(&iommu->lock, flags);
 
     if ( !dte->v || !dte->tv )
     {
         /* bind DTE to domain page-tables */
         rc = amd_iommu_set_root_page_table(
-                 dte, page_to_maddr(hd->arch.root_table),
-                 domain->domain_id, hd->arch.paging_mode, sr_flags);
+                 dte, page_to_maddr(root_pg), domid,
+                 hd->arch.paging_mode, sr_flags);
         if ( rc )
         {
             ASSERT(rc < 0);
@@ -172,7 +185,7 @@ static int __must_check amd_iommu_setup_
 
         amd_iommu_flush_device(iommu, req_id);
     }
-    else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.root_table)) )
+    else if ( dte->pt_root != mfn_x(page_to_mfn(root_pg)) )
     {
         /*
          * Strictly speaking if the device is the only one with this requestor
@@ -185,8 +198,8 @@ static int __must_check amd_iommu_setup_
             rc = -EOPNOTSUPP;
         else
             rc = amd_iommu_set_root_page_table(
-                     dte, page_to_maddr(hd->arch.root_table),
-                     domain->domain_id, hd->arch.paging_mode, sr_flags);
+                     dte, page_to_maddr(root_pg), domid,
+                     hd->arch.paging_mode, sr_flags);
         if ( rc < 0 )
         {
             spin_unlock_irqrestore(&iommu->lock, flags);
@@ -205,6 +218,7 @@ static int __must_check amd_iommu_setup_
               * intended anyway.
               */
              !pdev->domain->is_dying &&
+             pdev->domain != dom_io &&
              (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
               pdev->phantom_stride) )
             printk(" %04x:%02x:%02x.%u: reassignment may cause %pd data corruption\n",
@@ -235,9 +249,8 @@ static int __must_check amd_iommu_setup_
     AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
                     "root table = %#"PRIx64", "
                     "domain = %d, paging mode = %d\n",
-                    req_id, pdev->type,
-                    page_to_maddr(hd->arch.root_table),
-                    domain->domain_id, hd->arch.paging_mode);
+                    req_id, pdev->type, page_to_maddr(root_pg),
+                    domid, hd->arch.paging_mode);
 
     ASSERT(pcidevs_locked());
 
@@ -304,7 +317,7 @@ static int iov_enable_xt(void)
 
 int amd_iommu_alloc_root(struct domain_iommu *hd)
 {
-    if ( unlikely(!hd->arch.root_table) )
+    if ( unlikely(!hd->arch.root_table) && hd != dom_iommu(dom_io) )
     {
         hd->arch.root_table = alloc_amd_iommu_pgtable();
         if ( !hd->arch.root_table )
@@ -395,7 +408,7 @@ static void amd_iommu_disable_domain_dev
 
         AMD_IOMMU_DEBUG("Disable: device id = %#x, "
                         "domain = %d, paging mode = %d\n",
-                        req_id,  domain->domain_id,
+                        req_id, dte->domain_id,
                         dom_iommu(domain)->arch.paging_mode);
     }
     spin_unlock_irqrestore(&iommu->lock, flags);
@@ -659,6 +672,8 @@ static int amd_iommu_remove_device(u8 de
 
     amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
 
+    amd_iommu_quarantine_teardown(pdev);
+
     iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
     pdev->arch.pseudo_domid = DOMID_INVALID;
 
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -447,21 +447,21 @@ int iommu_iotlb_flush_all(struct domain
     return rc;
 }
 
-static int __init iommu_quarantine_init(void)
+int iommu_quarantine_dev_init(device_t *dev)
 {
     const struct domain_iommu *hd = dom_iommu(dom_io);
-    int rc;
 
-    dom_io->options |= XEN_DOMCTL_CDF_iommu;
+    if ( !iommu_quarantine || !hd->platform_ops->quarantine_init )
+        return 0;
 
-    rc = iommu_domain_init(dom_io, 0);
-    if ( rc )
-        return rc;
+    return iommu_call(hd->platform_ops, quarantine_init, dev);
+}
 
-    if ( !hd->platform_ops->quarantine_init )
-        return 0;
+static int __init iommu_quarantine_init(void)
+{
+    dom_io->options |= XEN_DOMCTL_CDF_iommu;
 
-    return hd->platform_ops->quarantine_init(dom_io);
+    return iommu_domain_init(dom_io, 0);
 }
 
 int __init iommu_setup(void)
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -929,9 +929,16 @@ static int deassign_device(struct domain
         return -ENODEV;
 
     /* De-assignment from dom_io should de-quarantine the device */
-    target = ((pdev->quarantine || iommu_quarantine) &&
-              pdev->domain != dom_io) ?
-        dom_io : hardware_domain;
+    if ( (pdev->quarantine || iommu_quarantine) && pdev->domain != dom_io )
+    {
+        ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( ret )
+           return ret;
+
+        target = dom_io;
+    }
+    else
+        target = hardware_domain;
 
     while ( pdev->phantom_stride )
     {
@@ -1528,6 +1535,13 @@ static int assign_device(struct domain *
         msixtbl_init(d);
     }
 
+    if ( pdev->domain != dom_io )
+    {
+        rc = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( rc )
+            goto done;
+    }
+
     pdev->fault.count = 0;
 
     if ( (rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag)) )
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -43,6 +43,12 @@
 #include "vtd.h"
 #include "../ats.h"
 
+#define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
+                                             : (pdev)->arch.pseudo_domid)
+#define DEVICE_PGTABLE(d, pdev) ((d) != dom_io \
+                                 ? dom_iommu(d)->arch.pgd_maddr \
+                                 : (pdev)->arch.vtd.pgd_maddr)
+
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
 
@@ -84,13 +90,18 @@ static int get_iommu_did(domid_t domid,
 
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
+
+/*
+ * This function may have "context" passed as NULL, to merely obtain a DID
+ * for "domid".
+ */
 static int context_set_domain_id(struct context_entry *context,
                                  domid_t domid, struct vtd_iommu *iommu)
 {
     unsigned long nr_dom, i;
     int found = 0;
 
-    ASSERT(spin_is_locked(&iommu->lock));
+    ASSERT(pcidevs_locked());
 
     nr_dom = cap_ndoms(iommu->cap);
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
@@ -116,8 +127,13 @@ static int context_set_domain_id(struct
     }
 
     set_bit(i, iommu->domid_bitmap);
-    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
-    context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+
+    if ( context )
+    {
+        context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
+        context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+    }
+
     return 0;
 }
 
@@ -167,8 +183,12 @@ static void check_cleanup_domid_map(stru
                                     const struct pci_dev *exclude,
                                     struct vtd_iommu *iommu)
 {
-    bool found = any_pdev_behind_iommu(d, exclude, iommu);
+    bool found;
+
+    if ( d == dom_io )
+        return;
 
+    found = any_pdev_behind_iommu(d, exclude, iommu);
     /*
      * Hidden devices are associated with DomXEN but usable by the hardware
      * domain. Hence they need considering here as well.
@@ -1403,7 +1423,7 @@ int domain_context_mapping_one(
         domid = iommu->domid_map[prev_did];
         if ( domid < DOMID_FIRST_RESERVED )
             prev_dom = rcu_lock_domain_by_id(domid);
-        else if ( domid == DOMID_IO )
+        else if ( pdev ? domid == pdev->arch.pseudo_domid : domid > DOMID_MASK )
             prev_dom = rcu_lock_domain(dom_io);
         if ( !prev_dom )
         {
@@ -1580,15 +1600,12 @@ int domain_context_mapping_one(
     {
         if ( !prev_dom )
             ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                           domain->domain_id);
+                                           DEVICE_DOMID(domain, pdev));
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
-        {
-            hd = dom_iommu(prev_dom);
             ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
-                                             domain->domain_id,
-                                             hd->arch.pgd_maddr,
+                                             DEVICE_DOMID(prev_dom, pdev),
+                                             DEVICE_PGTABLE(prev_dom, pdev),
                                              mode & MAP_WITH_RMRR) < 0;
-        }
         else
             ret = 1;
 
@@ -1610,7 +1627,7 @@ static int domain_context_mapping(struct
 {
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
-    paddr_t pgd_maddr = dom_iommu(domain)->arch.pgd_maddr;
+    paddr_t pgd_maddr = DEVICE_PGTABLE(domain, pdev);
     domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
@@ -1643,7 +1660,7 @@ static int domain_context_mapping(struct
         break;
     }
 
-    if ( domain != pdev->domain )
+    if ( domain != pdev->domain && pdev->domain != dom_io )
     {
         if ( pdev->domain->is_dying )
             mode |= MAP_OWNER_DYING;
@@ -1684,8 +1701,8 @@ static int domain_context_mapping(struct
             printk(VTDPREFIX "d%d:PCIe: map %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
-        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
+        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, pdev,
+                                         DEVICE_DOMID(domain, pdev), pgd_maddr,
                                          mode);
         if ( ret > 0 )
             ret = 0;
@@ -1709,8 +1726,8 @@ static int domain_context_mapping(struct
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
-                                         mode);
+                                         pdev, DEVICE_DOMID(domain, pdev),
+                                         pgd_maddr, mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1738,8 +1755,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1754,8 +1771,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         if ( ret )
         {
@@ -1905,7 +1922,7 @@ static const struct acpi_drhd_unit *doma
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1916,7 +1933,7 @@ static const struct acpi_drhd_unit *doma
             printk(VTDPREFIX "d%d:PCI: unmap %04x:%02x:%02x.%u\n",
                    domain->domain_id, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( ret )
             break;
 
@@ -1939,18 +1956,12 @@ static const struct acpi_drhd_unit *doma
             break;
         }
 
+        ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                       DEVICE_DOMID(domain, pdev));
         /* PCIe to PCI/PCIx bridge */
-        if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
-        {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
-            if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
-                                               domain->domain_id);
-        }
-        else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
+        if ( !ret && pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
+            ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                           DEVICE_DOMID(domain, pdev));
 
         break;
 
@@ -1993,6 +2004,25 @@ static void iommu_domain_teardown(struct
         cleanup_domid_map(d->domain_id, drhd->iommu);
 }
 
+static void quarantine_teardown(struct pci_dev *pdev,
+                                const struct acpi_drhd_unit *drhd)
+{
+    struct page_info *pg;
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.vtd.pgd_maddr )
+        return;
+
+    while ( (pg = page_list_remove_head(&pdev->arch.pgtables_list)) )
+        free_domheap_page(pg);
+
+    pdev->arch.vtd.pgd_maddr = 0;
+
+    if ( drhd )
+        cleanup_domid_map(pdev->arch.pseudo_domid, drhd->iommu);
+}
+
 static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
                                              mfn_t mfn, unsigned int flags,
                                              unsigned int *flush_flags)
@@ -2217,6 +2247,8 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
+    quarantine_teardown(pdev, drhd);
+
     if ( drhd )
     {
         iommu_free_domid(pdev->arch.pseudo_domid,
@@ -2908,60 +2940,139 @@ static void vtd_dump_p2m_table(struct do
     vtd_dump_p2m_table_level(hd->arch.pgd_maddr, agaw_to_level(hd->arch.agaw), 0, 0);
 }
 
-static int __init intel_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(struct dma_pte *this, unsigned int level,
+                    paddr_t maddrs[6], struct pci_dev *pdev)
 {
-    struct domain_iommu *hd = dom_iommu(d);
-    struct dma_pte *parent;
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_NUM; ++i )
+    {
+        struct dma_pte *pte = &this[i], *next;
+
+        if ( !dma_pte_present(*pte) )
+        {
+            if ( !maddrs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                maddrs[level] = alloc_pgtable_maddr(1, hd->node);
+                if ( !maddrs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                page_list_add(maddr_to_page(maddrs[level]),
+                              &pdev->arch.pgtables_list);
+
+                if ( level )
+                {
+                    next = map_vtd_domain_page(maddrs[level]);
+                    rc = fill_qpt(next, level - 1, maddrs, pdev);
+                    unmap_vtd_domain_page(next);
+                }
+            }
+
+            dma_set_pte_addr(*pte, maddrs[level]);
+            dma_set_pte_readable(*pte);
+            dma_set_pte_writable(*pte);
+        }
+        else if ( level && !dma_pte_superpage(*pte) )
+        {
+            page_list_add(maddr_to_page(dma_pte_addr(*pte)),
+                          &pdev->arch.pgtables_list);
+            next = map_vtd_domain_page(dma_pte_addr(*pte));
+            rc = fill_qpt(next, level - 1, maddrs, pdev);
+            unmap_vtd_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+static int intel_iommu_quarantine_init(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    paddr_t maddr;
     unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
     unsigned int level = agaw_to_level(agaw);
+    const struct acpi_drhd_unit *drhd;
+    const struct acpi_rmrr_unit *rmrr;
+    unsigned int i, bdf;
+    bool rmrr_found = false;
+    int rc;
 
-    if ( hd->arch.pgd_maddr )
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.pgd_maddr);
+
+    if ( pdev->arch.vtd.pgd_maddr )
     {
-        ASSERT_UNREACHABLE();
+        clear_domain_page(pdev->arch.leaf_mfn);
         return 0;
     }
 
-    spin_lock(&hd->arch.mapping_lock);
+    drhd = acpi_find_matched_drhd_unit(pdev);
+    if ( !drhd )
+        return -ENODEV;
 
-    hd->arch.pgd_maddr = alloc_pgtable_maddr(1, hd->node);
-    if ( !hd->arch.pgd_maddr )
-        goto out;
+    maddr = alloc_pgtable_maddr(1, hd->node);
+    if ( !maddr )
+        return -ENOMEM;
 
-    parent = map_vtd_domain_page(hd->arch.pgd_maddr);
-    while ( level )
-    {
-        uint64_t maddr;
-        unsigned int offset;
+    rc = context_set_domain_id(NULL, pdev->arch.pseudo_domid, drhd->iommu);
 
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        maddr = alloc_pgtable_maddr(1, hd->node);
-        if ( !maddr )
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
+    hd->arch.pgd_maddr = maddr;
+
+    for_each_rmrr_device ( rmrr, bdf, i )
+    {
+        if ( rc )
             break;
 
-        for ( offset = 0; offset < PTE_NUM; offset++ )
+        if ( rmrr->segment == pdev->seg && bdf == pdev->sbdf.bdf )
         {
-            struct dma_pte *pte = &parent[offset];
+            rmrr_found = true;
 
-            dma_set_pte_addr(*pte, maddr);
-            dma_set_pte_readable(*pte);
+            rc = iommu_identity_mapping(dom_io, p2m_access_rw,
+                                        rmrr->base_address, rmrr->end_address,
+                                        0);
+            if ( rc )
+                printk(XENLOG_ERR VTDPREFIX
+                       "%04x:%02x:%02x.%u: RMRR quarantine mapping failed\n",
+                       pdev->seg, pdev->bus,
+                       PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn));
         }
-        iommu_sync_cache(parent, PAGE_SIZE);
+    }
 
-        unmap_vtd_domain_page(parent);
-        parent = map_vtd_domain_page(maddr);
-        level--;
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.pgd_maddr = 0;
+    pdev->arch.vtd.pgd_maddr = maddr;
+
+    if ( !rc )
+    {
+        struct dma_pte *root;
+        paddr_t maddrs[6] = {};
+
+        spin_lock(&hd->arch.mapping_lock);
+
+        root = map_vtd_domain_page(maddr);
+        rc = fill_qpt(root, level - 1, maddrs, pdev);
+        unmap_vtd_domain_page(root);
+
+        pdev->arch.leaf_mfn = maddr_to_mfn(maddrs[0]);
+
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_vtd_domain_page(parent);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    if ( rc )
+        quarantine_teardown(pdev, drhd);
 
-    /* Pages leaked in failure case */
-    return level ? -ENOMEM : 0;
+    return rc;
 }
 
 const struct iommu_ops __initconstrel intel_iommu_ops = {
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -509,7 +509,7 @@ struct vtd_iommu {
     u32 nr_pt_levels;
     u64	cap;
     u64	ecap;
-    spinlock_t lock; /* protect context, domain ids */
+    spinlock_t lock; /* protect context */
     spinlock_t register_lock; /* protect iommu register handling */
     u64 root_maddr; /* root entry machine address */
     nodeid_t node;
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -236,7 +236,7 @@ typedef int iommu_grdm_t(xen_pfn_t start
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
-    int (*quarantine_init)(struct domain *d);
+    int (*quarantine_init)(device_t *dev);
     int (*add_device)(u8 devfn, device_t *dev);
     int (*enable_device)(device_t *dev);
     int (*remove_device)(u8 devfn, device_t *dev);
@@ -356,6 +356,7 @@ int __must_check iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
 int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
+int iommu_quarantine_dev_init(device_t *dev);
 
 void iommu_share_p2m_table(struct domain *d);
 
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix (de)assign ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully updated.

Also adjust a related log message.

This is CVE-2022-26358 / part of XSA-400.

Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2392,6 +2392,10 @@ static int reassign_device_ownership(
 {
     int ret;
 
+    ret = domain_context_unmap(source, devfn, pdev);
+    if ( ret )
+        return ret;
+
     /*
      * Devices assigned to untrusted domains (here assumed to be any domU)
      * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
@@ -2428,10 +2432,6 @@ static int reassign_device_ownership(
             }
     }
 
-    ret = domain_context_unmap(source, devfn, pdev);
-    if ( ret )
-        return ret;
-
     if ( devfn == pdev->devfn && pdev->domain != dom_io )
     {
         list_move(&pdev->domain_list, &dom_io->pdev_list);
@@ -2507,9 +2507,8 @@ static int intel_iommu_assign_device(
         }
     }
 
-    ret = reassign_device_ownership(s, d, devfn, pdev);
-    if ( ret || d == dom_io )
-        return ret;
+    if ( d == dom_io )
+        return reassign_device_ownership(s, d, devfn, pdev);
 
     /* Setup rmrr identity mapping */
     for_each_rmrr_device( rmrr, bdf, i )
@@ -2522,20 +2521,37 @@ static int intel_iommu_assign_device(
                                          rmrr->end_address, flag);
             if ( ret )
             {
-                int rc;
-
-                rc = reassign_device_ownership(d, s, devfn, pdev);
                 printk(XENLOG_G_ERR VTDPREFIX
-                       " cannot map reserved region (%"PRIx64",%"PRIx64"] for Dom%d (%d)\n",
-                       rmrr->base_address, rmrr->end_address,
-                       d->domain_id, ret);
-                if ( rc )
-                {
-                    printk(XENLOG_ERR VTDPREFIX
-                           " failed to reclaim %pp from %pd (%d)\n",
-                           &PCI_SBDF3(seg, bus, devfn), d, rc);
-                    domain_crash(d);
-                }
+                       "%pd: cannot map reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, ret);
+                break;
+            }
+        }
+    }
+
+    if ( !ret )
+        ret = reassign_device_ownership(s, d, devfn, pdev);
+
+    /* See reassign_device_ownership() for the hwdom aspect. */
+    if ( !ret || is_hardware_domain(d) )
+        return ret;
+
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment == seg &&
+             PCI_BUS(bdf) == bus &&
+             PCI_DEVFN2(bdf) == devfn )
+        {
+            int rc = iommu_identity_mapping(d, p2m_access_x,
+                                            rmrr->base_address,
+                                            rmrr->end_address, 0);
+
+            if ( rc && rc != -ENOENT )
+            {
+                printk(XENLOG_ERR VTDPREFIX
+                       "%pd: cannot unmap reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, rc);
+                domain_crash(d);
                 break;
             }
         }
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix add/remove ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully cleared.

Also switch to %pd in related log messages.

Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR")
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1981,14 +1981,6 @@ static int intel_iommu_add_device(u8 dev
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_mapping(pdev->domain, devfn, pdev);
-    if ( ret )
-    {
-        dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n",
-                pdev->domain->domain_id);
-        return ret;
-    }
-
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment == pdev->seg &&
@@ -2005,12 +1997,17 @@ static int intel_iommu_add_device(u8 dev
                                          rmrr->base_address, rmrr->end_address,
                                          0);
             if ( ret )
-                dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
-                        pdev->domain->domain_id);
+                dprintk(XENLOG_ERR VTDPREFIX, "%pd: RMRR mapping failed\n",
+                        pdev->domain);
         }
     }
 
-    return 0;
+    ret = domain_context_mapping(pdev->domain, devfn, pdev);
+    if ( ret )
+        dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n",
+                pdev->domain);
+
+    return ret;
 }
 
 static int intel_iommu_enable_device(struct pci_dev *pdev)
@@ -2032,11 +2029,15 @@ static int intel_iommu_remove_device(u8
 {
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int i;
+    int ret, i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
+    ret = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( ret )
+        return ret;
+
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment != pdev->seg ||
@@ -2052,7 +2053,7 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
-    return domain_context_unmap(pdev->domain, devfn, pdev);
+    return 0;
 }
 
 static int __hwdom_init setup_hwdom_device(u8 devfn, struct pci_dev *pdev)
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: tighten iommu_alloc_pgtable()'s parameter

This is to make more obvious that nothing outside of domain_iommu(d)
actually changes or is otherwise needed by the function.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -143,7 +143,8 @@ int pi_update_irte(const struct pi_desc
 })
 
 int __must_check iommu_free_pgtables(struct domain *d);
-struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
+struct domain_iommu;
+struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -184,7 +184,7 @@ static int iommu_pde_from_dfn(struct dom
     unsigned long  next_table_mfn;
     unsigned int level;
     struct page_info *table;
-    const struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(d);
 
     table = hd->arch.amd.root_table;
     level = hd->arch.amd.paging_mode;
@@ -220,7 +220,7 @@ static int iommu_pde_from_dfn(struct dom
             mfn = next_table_mfn;
 
             /* allocate lower level page table */
-            table = iommu_alloc_pgtable(d);
+            table = iommu_alloc_pgtable(hd);
             if ( table == NULL )
             {
                 AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
@@ -250,7 +250,7 @@ static int iommu_pde_from_dfn(struct dom
 
             if ( next_table_mfn == 0 )
             {
-                table = iommu_alloc_pgtable(d);
+                table = iommu_alloc_pgtable(hd);
                 if ( table == NULL )
                 {
                     AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
@@ -483,7 +483,7 @@ int __init amd_iommu_quarantine_init(str
 
     spin_lock(&hd->arch.mapping_lock);
 
-    hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+    hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
     if ( !hd->arch.amd.root_table )
         goto out;
 
@@ -498,7 +498,7 @@ int __init amd_iommu_quarantine_init(str
          * page table pages, and the resulting allocations are always
          * zeroed.
          */
-        pg = iommu_alloc_pgtable(d);
+        pg = iommu_alloc_pgtable(hd);
         if ( !pg )
             break;
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -208,7 +208,7 @@ int amd_iommu_alloc_root(struct domain *
 
     if ( unlikely(!hd->arch.amd.root_table) )
     {
-        hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+        hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
         if ( !hd->arch.amd.root_table )
             return -ENOMEM;
     }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -327,7 +327,7 @@ static u64 addr_to_dma_page_maddr(struct
     {
         struct page_info *pg;
 
-        if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
+        if ( !alloc || !(pg = iommu_alloc_pgtable(hd)) )
             goto out;
 
         hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
@@ -347,7 +347,7 @@ static u64 addr_to_dma_page_maddr(struct
             if ( !alloc )
                 break;
 
-            pg = iommu_alloc_pgtable(domain);
+            pg = iommu_alloc_pgtable(hd);
             if ( !pg )
                 break;
 
@@ -2761,7 +2761,7 @@ static int __init intel_iommu_quarantine
         goto out;
     }
 
-    pg = iommu_alloc_pgtable(d);
+    pg = iommu_alloc_pgtable(hd);
 
     rc = -ENOMEM;
     if ( !pg )
@@ -2780,7 +2780,7 @@ static int __init intel_iommu_quarantine
          * page table pages, and the resulting allocations are always
          * zeroed.
          */
-        pg = iommu_alloc_pgtable(d);
+        pg = iommu_alloc_pgtable(hd);
 
         if ( !pg )
             goto out;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -415,9 +415,8 @@ int iommu_free_pgtables(struct domain *d
     return 0;
 }
 
-struct page_info *iommu_alloc_pgtable(struct domain *d)
+struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd)
 {
-    struct domain_iommu *hd = dom_iommu(d);
     unsigned int memflags = 0;
     struct page_info *pg;
     void *p;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: drop ownership checking from domain_context_mapping_one()

Despite putting in quite a bit of effort it was not possible to
establish why exactly this code exists (beyond possibly sanity
checking). Instead of a subsequent change further complicating this
logic, simply get rid of it.

Take the opportunity and move the respective unmap_vtd_domain_page() out
of the locked region.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -121,28 +121,6 @@ static int context_set_domain_id(struct
     return 0;
 }
 
-static int context_get_domain_id(struct context_entry *context,
-                                 struct vtd_iommu *iommu)
-{
-    unsigned long dom_index, nr_dom;
-    int domid = -1;
-
-    if (iommu && context)
-    {
-        nr_dom = cap_ndoms(iommu->cap);
-
-        dom_index = context_domain_id(*context);
-
-        if ( dom_index < nr_dom && iommu->domid_map )
-            domid = iommu->domid_map[dom_index];
-        else
-            dprintk(XENLOG_DEBUG VTDPREFIX,
-                    "dom_index %lu exceeds nr_dom %lu or iommu has no domid_map\n",
-                    dom_index, nr_dom);
-    }
-    return domid;
-}
-
 static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
 {
     int iommu_domid = domain_iommu_domid(domain, iommu);
@@ -1404,44 +1382,9 @@ int domain_context_mapping_one(
 
     if ( context_present(*context) )
     {
-        int res = 0;
-
-        /* Try to get domain ownership from device structure.  If that's
-         * not available, try to read it from the context itself. */
-        if ( pdev )
-        {
-            if ( pdev->domain != domain )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX "%pd: %pp owned by %pd",
-                       domain, &PCI_SBDF3(seg, bus, devfn),
-                       pdev->domain);
-                res = -EINVAL;
-            }
-        }
-        else
-        {
-            int cdomain;
-            cdomain = context_get_domain_id(context, iommu);
-            
-            if ( cdomain < 0 )
-            {
-                printk(XENLOG_G_WARNING VTDPREFIX
-                       "%pd: %pp mapped, but can't find owner\n",
-                       domain, &PCI_SBDF3(seg, bus, devfn));
-                res = -EINVAL;
-            }
-            else if ( cdomain != domain->domain_id )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX
-                       "%pd: %pp already mapped to d%d",
-                       domain, &PCI_SBDF3(seg, bus, devfn), cdomain);
-                res = -EINVAL;
-            }
-        }
-
-        unmap_vtd_domain_page(context_entries);
         spin_unlock(&iommu->lock);
-        return res;
+        unmap_vtd_domain_page(context_entries);
+        return 0;
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: re-assign devices directly

Devices with RMRRs, due to it being unspecified how/when the specified
memory regions may get accessed, may not be left disconnected from their
respective mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than unmapping the old context and
then mapping the new one, re-assignment needs to be done in a single
step.

This is CVE-2022-26359 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any RMRRs. The
main difference is when it comes to updating context entries, which need
to be atomic when there are RMRRs. Yet atomicity can only be achieved
with CMPXCHG16B, availability of which we can't take for given.

The seemingly complicated choice of non-negative return values for
domain_context_mapping_one() is to limit code churn: This way callers
passing NULL for pdev don't need fiddling with.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -84,7 +84,8 @@ void free_pgtable_maddr(u64 maddr);
 void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(const void *va);
 int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
-                               u8 bus, u8 devfn, const struct pci_dev *);
+                               uint8_t bus, uint8_t devfn,
+                               const struct pci_dev *pdev, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
                              u8 bus, u8 devfn);
 int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
@@ -103,8 +104,8 @@ int is_igd_vt_enabled_quirk(void);
 void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
-int __must_check me_wifi_quirk(struct domain *domain,
-                               u8 bus, u8 devfn, int map);
+int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
+                               uint8_t devfn, unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -117,6 +117,7 @@ static int context_set_domain_id(struct
     }
 
     set_bit(i, iommu->domid_bitmap);
+    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
     context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
     return 0;
 }
@@ -1362,15 +1363,27 @@ static void __hwdom_init intel_iommu_hwd
     }
 }
 
+/*
+ * This function returns
+ * - a negative errno value upon error,
+ * - zero upon success when previously the entry was non-present, or this isn't
+ *   the "main" request for a device (pdev == NULL), or for no-op quarantining
+ *   assignments,
+ * - positive (one) upon success when previously the entry was present and this
+ *   is the "main" request for a device (pdev != NULL).
+ */
 int domain_context_mapping_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
-    u8 bus, u8 devfn, const struct pci_dev *pdev)
+    uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
+    unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
-    struct context_entry *context, *context_entries;
+    struct context_entry *context, *context_entries, lctxt;
+    __uint128_t old;
     u64 maddr, pgd_maddr;
-    u16 seg = iommu->drhd->segment;
+    uint16_t seg = iommu->drhd->segment, prev_did = 0;
+    struct domain *prev_dom = NULL;
     int rc, ret;
     bool_t flush_dev_iotlb;
 
@@ -1379,17 +1392,32 @@ int domain_context_mapping_one(
     maddr = bus_to_context_maddr(iommu, bus);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
     context = &context_entries[devfn];
+    old = (lctxt = *context).full;
 
-    if ( context_present(*context) )
+    if ( context_present(lctxt) )
     {
-        spin_unlock(&iommu->lock);
-        unmap_vtd_domain_page(context_entries);
-        return 0;
+        domid_t domid;
+
+        prev_did = context_domain_id(lctxt);
+        domid = iommu->domid_map[prev_did];
+        if ( domid < DOMID_FIRST_RESERVED )
+            prev_dom = rcu_lock_domain_by_id(domid);
+        else if ( domid == DOMID_IO )
+            prev_dom = rcu_lock_domain(dom_io);
+        if ( !prev_dom )
+        {
+            spin_unlock(&iommu->lock);
+            unmap_vtd_domain_page(context_entries);
+            dprintk(XENLOG_DEBUG VTDPREFIX,
+                    "no domain for did %u (nr_dom %u)\n",
+                    prev_did, cap_ndoms(iommu->cap));
+            return -ESRCH;
+        }
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
     {
-        context_set_translation_type(*context, CONTEXT_TT_PASS_THRU);
+        context_set_translation_type(lctxt, CONTEXT_TT_PASS_THRU);
     }
     else
     {
@@ -1401,36 +1429,107 @@ int domain_context_mapping_one(
             spin_unlock(&hd->arch.mapping_lock);
             spin_unlock(&iommu->lock);
             unmap_vtd_domain_page(context_entries);
+            if ( prev_dom )
+                rcu_unlock_domain(prev_dom);
             return -ENOMEM;
         }
 
-        context_set_address_root(*context, pgd_maddr);
+        context_set_address_root(lctxt, pgd_maddr);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
-            context_set_translation_type(*context, CONTEXT_TT_DEV_IOTLB);
+            context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
-            context_set_translation_type(*context, CONTEXT_TT_MULTI_LEVEL);
+            context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL);
 
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    if ( context_set_domain_id(context, domain, iommu) )
+    if ( context_set_domain_id(&lctxt, domain, iommu) )
     {
+    unlock:
         spin_unlock(&iommu->lock);
         unmap_vtd_domain_page(context_entries);
+        if ( prev_dom )
+            rcu_unlock_domain(prev_dom);
         return -EFAULT;
     }
 
-    context_set_address_width(*context, level_to_agaw(iommu->nr_pt_levels));
-    context_set_fault_enable(*context);
-    context_set_present(*context);
+    if ( !prev_dom )
+    {
+        context_set_address_width(lctxt, level_to_agaw(iommu->nr_pt_levels));
+        context_set_fault_enable(lctxt);
+        context_set_present(lctxt);
+    }
+    else if ( prev_dom == domain )
+    {
+        ASSERT(lctxt.full == context->full);
+        rc = !!pdev;
+        goto unlock;
+    }
+    else
+    {
+        ASSERT(context_address_width(lctxt) ==
+               level_to_agaw(iommu->nr_pt_levels));
+        ASSERT(!context_fault_disable(lctxt));
+    }
+
+    if ( cpu_has_cx16 )
+    {
+        __uint128_t res = cmpxchg16b(context, &old, &lctxt.full);
+
+        /*
+         * Hardware does not update the context entry behind our backs,
+         * so the return value should match "old".
+         */
+        if ( res != old )
+        {
+            if ( pdev )
+                check_cleanup_domid_map(domain, pdev, iommu);
+            printk(XENLOG_ERR
+                   "%pp: unexpected context entry %016lx_%016lx (expected %016lx_%016lx)\n",
+                   &PCI_SBDF3(pdev->seg, pdev->bus, devfn),
+                   (uint64_t)(res >> 64), (uint64_t)res,
+                   (uint64_t)(old >> 64), (uint64_t)old);
+            rc = -EILSEQ;
+            goto unlock;
+        }
+    }
+    else if ( !prev_dom || !(mode & MAP_WITH_RMRR) )
+    {
+        context_clear_present(*context);
+        iommu_sync_cache(context, sizeof(*context));
+
+        write_atomic(&context->hi, lctxt.hi);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->lo, lctxt.lo);
+    }
+    else /* Best effort, updating DID last. */
+    {
+         /*
+          * By non-atomically updating the context entry's DID field last,
+          * during a short window in time TLB entries with the old domain ID
+          * but the new page tables may be inserted.  This could affect I/O
+          * of other devices using this same (old) domain ID.  Such updating
+          * therefore is not a problem if this was the only device associated
+          * with the old domain ID.  Diverting I/O of any of a dying domain's
+          * devices to the quarantine page tables is intended anyway.
+          */
+        if ( !(mode & (MAP_OWNER_DYING | MAP_SINGLE_DEVICE)) )
+            printk(XENLOG_WARNING VTDPREFIX
+                   " %pp: reassignment may cause %pd data corruption\n",
+                   &PCI_SBDF3(seg, bus, devfn), prev_dom);
+
+        write_atomic(&context->lo, lctxt.lo);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->hi, lctxt.hi);
+    }
+
     iommu_sync_cache(context, sizeof(struct context_entry));
     spin_unlock(&iommu->lock);
 
-    /* Context entry was previously non-present (with domid 0). */
-    rc = iommu_flush_context_device(iommu, 0, PCI_BDF2(bus, devfn),
-                                    DMA_CCMD_MASK_NOBIT, 1);
+    rc = iommu_flush_context_device(iommu, prev_did, PCI_BDF2(bus, devfn),
+                                    DMA_CCMD_MASK_NOBIT, !prev_dom);
     flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-    ret = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
+    ret = iommu_flush_iotlb_dsi(iommu, prev_did, !prev_dom, flush_dev_iotlb);
 
     /*
      * The current logic for returns:
@@ -1451,17 +1550,26 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, mode);
 
     if ( rc )
     {
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        if ( !prev_dom )
+            ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+            ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                             mode & MAP_WITH_RMRR) < 0;
+        else
+            ret = 1;
 
         if ( !ret && pdev && pdev->devfn == devfn )
             check_cleanup_domid_map(domain, pdev, iommu);
     }
 
-    return rc;
+    if ( prev_dom )
+        rcu_unlock_domain(prev_dom);
+
+    return rc ?: pdev && prev_dom;
 }
 
 static int domain_context_unmap(struct domain *d, uint8_t devfn,
@@ -1471,8 +1579,10 @@ static int domain_context_mapping(struct
                                   struct pci_dev *pdev)
 {
     struct acpi_drhd_unit *drhd;
+    const struct acpi_rmrr_unit *rmrr;
     int ret = 0;
-    uint16_t seg = pdev->seg;
+    unsigned int i, mode = 0;
+    uint16_t seg = pdev->seg, bdf;
     uint8_t bus = pdev->bus, secbus;
 
     drhd = acpi_find_matched_drhd_unit(pdev);
@@ -1492,8 +1602,29 @@ static int domain_context_mapping(struct
 
     ASSERT(pcidevs_locked());
 
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment != pdev->seg || bdf != pdev->sbdf.bdf )
+            continue;
+
+        mode |= MAP_WITH_RMRR;
+        break;
+    }
+
+    if ( domain != pdev->domain )
+    {
+        if ( pdev->domain->is_dying )
+            mode |= MAP_OWNER_DYING;
+        else if ( drhd &&
+                  !any_pdev_behind_iommu(pdev->domain, pdev, drhd->iommu) &&
+                  !pdev->phantom_stride )
+            mode |= MAP_SINGLE_DEVICE;
+    }
+
     switch ( pdev->type )
     {
+        bool prev_present;
+
     case DEV_TYPE_PCI_HOST_BRIDGE:
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:Hostbridge: skip %pp map\n",
@@ -1512,7 +1643,9 @@ static int domain_context_mapping(struct
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
+                                         pdev, mode);
+        if ( ret > 0 )
+            ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             enable_ats_device(pdev, &drhd->iommu->ats_devices);
 
@@ -1524,9 +1657,10 @@ static int domain_context_mapping(struct
                    domain, &PCI_SBDF3(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
-        if ( ret )
+                                         pdev, mode);
+        if ( ret < 0 )
             break;
+        prev_present = ret;
 
         if ( (ret = find_upstream_bridge(seg, &bus, &devfn, &secbus)) < 1 )
         {
@@ -1534,6 +1668,15 @@ static int domain_context_mapping(struct
                 break;
             ret = -ENXIO;
         }
+        /*
+         * Strictly speaking if the device is the only one behind this bridge
+         * and the only one with this (secbus,0,0) tuple, it could be allowed
+         * to be re-assigned regardless of RMRR presence.  But let's deal with
+         * that case only if it is actually found in the wild.
+         */
+        else if ( prev_present && (mode & MAP_WITH_RMRR) &&
+                  domain != pdev->domain )
+            ret = -EOPNOTSUPP;
 
         /*
          * Mapping a bridge should, if anything, pass the struct pci_dev of
@@ -1542,7 +1685,7 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL);
+                                             NULL, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1557,10 +1700,15 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL);
+                                             NULL, mode);
 
         if ( ret )
-            domain_context_unmap(domain, devfn, pdev);
+        {
+            if ( !prev_present )
+                domain_context_unmap(domain, devfn, pdev);
+            else if ( pdev->domain != domain ) /* Avoid infinite recursion. */
+                domain_context_mapping(pdev->domain, devfn, pdev);
+        }
 
         break;
 
@@ -2336,9 +2484,8 @@ static int reassign_device_ownership(
 {
     int ret;
 
-    ret = domain_context_unmap(source, devfn, pdev);
-    if ( ret )
-        return ret;
+    if ( !has_arch_pdevs(target) )
+        vmx_pi_hooks_assign(target);
 
     /*
      * Devices assigned to untrusted domains (here assumed to be any domU)
@@ -2348,6 +2495,31 @@ static int reassign_device_ownership(
     if ( (target != hardware_domain) && !iommu_intremap )
         untrusted_msi = true;
 
+    ret = domain_context_mapping(target, devfn, pdev);
+    if ( ret )
+    {
+        if ( !has_arch_pdevs(target) )
+            vmx_pi_hooks_deassign(target);
+        return ret;
+    }
+
+    if ( pdev->devfn == devfn )
+    {
+        const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+
+        if ( drhd )
+            check_cleanup_domid_map(source, pdev, drhd->iommu);
+    }
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
+
+    if ( !has_arch_pdevs(source) )
+        vmx_pi_hooks_deassign(source);
+
     /*
      * If the device belongs to the hardware domain, and it has RMRR, don't
      * remove it from the hardware domain, because BIOS may use RMRR at
@@ -2376,34 +2548,7 @@ static int reassign_device_ownership(
             }
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    if ( !has_arch_pdevs(source) )
-        vmx_pi_hooks_deassign(source);
-
-    if ( !has_arch_pdevs(target) )
-        vmx_pi_hooks_assign(target);
-
-    ret = domain_context_mapping(target, devfn, pdev);
-    if ( ret )
-    {
-        if ( !has_arch_pdevs(target) )
-            vmx_pi_hooks_deassign(target);
-
-        return ret;
-    }
-
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
-    }
-
-    return ret;
+    return 0;
 }
 
 static int intel_iommu_assign_device(
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -202,8 +202,12 @@ struct root_entry {
     do {(root).val |= ((value) & PAGE_MASK_4K);} while(0)
 
 struct context_entry {
-    u64 lo;
-    u64 hi;
+    union {
+        struct {
+            uint64_t lo, hi;
+        };
+        __uint128_t full;
+    };
 };
 #define ROOT_ENTRY_NR (PAGE_SIZE_4K/sizeof(struct root_entry))
 #define context_present(c) ((c).lo & 1)
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -344,7 +344,8 @@ void __init platform_quirks_init(void)
  */
 
 static int __must_check map_me_phantom_function(struct domain *domain,
-                                                u32 dev, int map)
+                                                unsigned int dev,
+                                                unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
     struct pci_dev *pdev;
@@ -355,9 +356,9 @@ static int __must_check map_me_phantom_f
     drhd = acpi_find_matched_drhd_unit(pdev);
 
     /* map or unmap ME phantom function */
-    if ( map )
+    if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL);
+                                        PCI_DEVFN(dev, 7), NULL, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
                                       PCI_DEVFN(dev, 7));
@@ -365,7 +366,8 @@ static int __must_check map_me_phantom_f
     return rc;
 }
 
-int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
+int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
+                  unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -389,7 +391,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, map);
+                rc = map_me_phantom_function(domain, 3, mode);
                 break;
             default:
                 break;
@@ -415,7 +417,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, map);
+                rc = map_me_phantom_function(domain, 22, mode);
                 break;
             default:
                 break;
--- a/xen/drivers/passthrough/vtd/vtd.h
+++ b/xen/drivers/passthrough/vtd/vtd.h
@@ -22,8 +22,14 @@
 
 #include <xen/iommu.h>
 
-#define MAP_ME_PHANTOM_FUNC      1
-#define UNMAP_ME_PHANTOM_FUNC    0
+/*
+ * Values for domain_context_mapping_one()'s and me_wifi_quirk()'s "mode"
+ * parameters.
+ */
+#define MAP_WITH_RMRR         (1u << 0)
+#define MAP_OWNER_DYING       (1u << 1)
+#define MAP_SINGLE_DEVICE     (1u << 2)
+#define UNMAP_ME_PHANTOM_FUNC (1u << 3)
 
 /* Allow for both IOAPIC and IOSAPIC. */
 #define IO_xAPIC_route_entry IO_APIC_route_entry
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-assign devices directly

Devices with unity map ranges, due to it being unspecified how/when
these memory ranges may get accessed, may not be left disconnected from
their unity mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than tearing down the old root page
table pointer and then establishing the new one, re-assignment needs to
be done in a single step.

This is CVE-2022-26360 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any unity map
ranges.  The main difference is when it comes to updating DTEs, which need
to be atomic when there are unity mappings. Yet atomicity can only be
achieved with CMPXCHG16B, availability of which we can't take for given.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -247,9 +247,13 @@ void amd_iommu_set_intremap_table(struct
                                   const void *ptr,
                                   const struct amd_iommu *iommu,
                                   bool valid);
-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
-				   uint64_t root_ptr, uint16_t domain_id,
-				   uint8_t paging_mode, bool valid);
+#define SET_ROOT_VALID          (1u << 0)
+#define SET_ROOT_WITH_UNITY_MAP (1u << 1)
+int __must_check amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+                                               uint64_t root_ptr,
+                                               uint16_t domain_id,
+                                               uint8_t paging_mode,
+                                               unsigned int flags);
 void iommu_dte_add_device_entry(struct amd_iommu_dte *dte,
                                 const struct ivrs_mappings *ivrs_dev);
 
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -114,10 +114,69 @@ static unsigned int set_iommu_ptes_prese
     return flush_flags;
 }
 
-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
-                                   uint64_t root_ptr, uint16_t domain_id,
-                                   uint8_t paging_mode, bool valid)
+/*
+ * This function returns
+ * - -errno for errors,
+ * - 0 for a successful update, atomic when necessary
+ * - 1 for a successful but non-atomic update, which may need to be warned
+ *   about by the caller.
+ */
+int amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+                                  uint64_t root_ptr, uint16_t domain_id,
+                                  uint8_t paging_mode, unsigned int flags)
 {
+    bool valid = flags & SET_ROOT_VALID;
+
+    if ( dte->v && dte->tv &&
+         (cpu_has_cx16 || (flags & SET_ROOT_WITH_UNITY_MAP)) )
+    {
+        union {
+            struct amd_iommu_dte dte;
+            uint64_t raw64[4];
+            __uint128_t raw128[2];
+        } ldte = { .dte = *dte };
+        __uint128_t old = ldte.raw128[0];
+        int ret = 0;
+
+        ldte.dte.domain_id = domain_id;
+        ldte.dte.pt_root = paddr_to_pfn(root_ptr);
+        ldte.dte.iw = true;
+        ldte.dte.ir = true;
+        ldte.dte.paging_mode = paging_mode;
+        ldte.dte.v = valid;
+
+        if ( cpu_has_cx16 )
+        {
+            __uint128_t res = cmpxchg16b(dte, &old, &ldte.raw128[0]);
+
+            /*
+             * Hardware does not update the DTE behind our backs, so the
+             * return value should match "old".
+             */
+            if ( res != old )
+            {
+                printk(XENLOG_ERR
+                       "Dom%d: unexpected DTE %016lx_%016lx (expected %016lx_%016lx)\n",
+                       domain_id,
+                       (uint64_t)(res >> 64), (uint64_t)res,
+                       (uint64_t)(old >> 64), (uint64_t)old);
+                ret = -EILSEQ;
+            }
+        }
+        else /* Best effort, updating domain_id last. */
+        {
+            uint64_t *ptr = (void *)dte;
+
+            write_atomic(ptr + 0, ldte.raw64[0]);
+            /* No barrier should be needed between these two. */
+            write_atomic(ptr + 1, ldte.raw64[1]);
+
+            ret = 1;
+        }
+
+        return ret;
+    }
+
     if ( valid || dte->v )
     {
         dte->tv = false;
@@ -132,6 +191,8 @@ void amd_iommu_set_root_page_table(struc
     smp_wmb();
     dte->tv = true;
     dte->v = valid;
+
+    return 0;
 }
 
 void amd_iommu_set_intremap_table(
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -81,41 +81,82 @@ int get_dma_requestor_id(uint16_t seg, u
     return req_id;
 }
 
-static void amd_iommu_setup_domain_device(
+static int __must_check allocate_domain_resources(struct domain *d)
+{
+    struct domain_iommu *hd = dom_iommu(d);
+    int rc;
+
+    spin_lock(&hd->arch.mapping_lock);
+    rc = amd_iommu_alloc_root(d);
+    spin_unlock(&hd->arch.mapping_lock);
+
+    return rc;
+}
+
+static bool any_pdev_behind_iommu(const struct domain *d,
+                                  const struct pci_dev *exclude,
+                                  const struct amd_iommu *iommu)
+{
+    const struct pci_dev *pdev;
+
+    for_each_pdev ( d, pdev )
+    {
+        if ( pdev == exclude )
+            continue;
+
+        if ( find_iommu_for_device(pdev->seg, pdev->sbdf.bdf) == iommu )
+            return true;
+    }
+
+    return false;
+}
+
+static int __must_check amd_iommu_setup_domain_device(
     struct domain *domain, struct amd_iommu *iommu,
     uint8_t devfn, struct pci_dev *pdev)
 {
     struct amd_iommu_dte *table, *dte;
     unsigned long flags;
-    int req_id, valid = 1;
+    unsigned int req_id, sr_flags;
+    int rc;
     u8 bus = pdev->bus;
     const struct domain_iommu *hd = dom_iommu(domain);
+    const struct ivrs_mappings *ivrs_dev;
+
+    BUG_ON(!hd->arch.amd.paging_mode || !iommu->dev_table.buffer);
 
-    BUG_ON( !hd->arch.amd.root_table ||
-            !hd->arch.amd.paging_mode ||
-            !iommu->dev_table.buffer );
+    rc = allocate_domain_resources(domain);
+    if ( rc )
+        return rc;
 
-    if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
-        valid = 0;
+    req_id = get_dma_requestor_id(iommu->seg, pdev->sbdf.bdf);
+    ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
+    sr_flags = (iommu_hwdom_passthrough && is_hardware_domain(domain)
+                ? 0 : SET_ROOT_VALID)
+               | (ivrs_dev->unity_map ? SET_ROOT_WITH_UNITY_MAP : 0);
 
     /* get device-table entry */
     req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
     table = iommu->dev_table.buffer;
     dte = &table[req_id];
+    ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
     spin_lock_irqsave(&iommu->lock, flags);
 
     if ( !dte->v || !dte->tv )
     {
-        const struct ivrs_mappings *ivrs_dev;
-
         /* bind DTE to domain page-tables */
-        amd_iommu_set_root_page_table(
-            dte, page_to_maddr(hd->arch.amd.root_table),
-            domain->domain_id, hd->arch.amd.paging_mode, valid);
+        rc = amd_iommu_set_root_page_table(
+                 dte, page_to_maddr(hd->arch.amd.root_table),
+                 domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+        if ( rc )
+        {
+            ASSERT(rc < 0);
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
 
         /* Undo what amd_iommu_disable_domain_device() may have done. */
-        ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
         if ( dte->it_root )
         {
             dte->int_ctl = IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED;
@@ -130,17 +171,73 @@ static void amd_iommu_setup_domain_devic
             dte->i = ats_enabled;
 
         amd_iommu_flush_device(iommu, req_id);
+    }
+    else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.amd.root_table)) )
+    {
+        /*
+         * Strictly speaking if the device is the only one with this requestor
+         * ID, it could be allowed to be re-assigned regardless of unity map
+         * presence.  But let's deal with that case only if it is actually
+         * found in the wild.
+         */
+        if ( req_id != PCI_BDF2(bus, devfn) &&
+             (sr_flags & SET_ROOT_WITH_UNITY_MAP) )
+            rc = -EOPNOTSUPP;
+        else
+            rc = amd_iommu_set_root_page_table(
+                     dte, page_to_maddr(hd->arch.amd.root_table),
+                     domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+        if ( rc < 0 )
+        {
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
+        if ( rc &&
+             domain != pdev->domain &&
+             /*
+              * By non-atomically updating the DTE's domain ID field last,
+              * during a short window in time TLB entries with the old domain
+              * ID but the new page tables may have been inserted.  This could
+              * affect I/O of other devices using this same (old) domain ID.
+              * Such updating therefore is not a problem if this was the only
+              * device associated with the old domain ID.  Diverting I/O of any
+              * of a dying domain's devices to the quarantine page tables is
+              * intended anyway.
+              */
+             !pdev->domain->is_dying &&
+             (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
+              pdev->phantom_stride) )
+            printk(" %pp: reassignment may cause %pd data corruption\n",
+                   &PCI_SBDF3(pdev->seg, bus, devfn), pdev->domain);
+
+        /*
+         * Check remaining settings are still in place from an earlier call
+         * here. They're all independent of the domain, so should not have
+         * changed.
+         */
+        if ( dte->it_root )
+            ASSERT(dte->int_ctl == IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED);
+        ASSERT(dte->iv == iommu_intremap);
+        ASSERT(dte->ex == ivrs_dev->dte_allow_exclusion);
+        ASSERT(dte->sys_mgt == MASK_EXTR(ivrs_dev->device_flags,
+                                         ACPI_IVHD_SYSTEM_MGMT));
 
-        AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
-                        "root table = %#"PRIx64", "
-                        "domain = %d, paging mode = %d\n",
-                        req_id, pdev->type,
-                        page_to_maddr(hd->arch.amd.root_table),
-                        domain->domain_id, hd->arch.amd.paging_mode);
+        if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
+             iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) )
+            ASSERT(dte->i == ats_enabled);
+
+        amd_iommu_flush_device(iommu, req_id);
     }
 
     spin_unlock_irqrestore(&iommu->lock, flags);
 
+    AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
+                    "root table = %#"PRIx64", "
+                    "domain = %d, paging mode = %d\n",
+                    req_id, pdev->type,
+                    page_to_maddr(hd->arch.amd.root_table),
+                    domain->domain_id, hd->arch.amd.paging_mode);
+
     ASSERT(pcidevs_locked());
 
     if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
@@ -151,6 +248,8 @@ static void amd_iommu_setup_domain_devic
 
         amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
     }
+
+    return 0;
 }
 
 int __init acpi_ivrs_init(void)
@@ -216,18 +315,6 @@ int amd_iommu_alloc_root(struct domain *
     return 0;
 }
 
-static int __must_check allocate_domain_resources(struct domain *d)
-{
-    struct domain_iommu *hd = dom_iommu(d);
-    int rc;
-
-    spin_lock(&hd->arch.mapping_lock);
-    rc = amd_iommu_alloc_root(d);
-    spin_unlock(&hd->arch.mapping_lock);
-
-    return rc;
-}
-
 int __read_mostly amd_iommu_min_paging_mode = 1;
 
 static int amd_iommu_domain_init(struct domain *d)
@@ -340,7 +427,15 @@ static int reassign_device(struct domain
         return -ENODEV;
     }
 
-    amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
+    rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
+    if ( rc )
+        return rc;
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
 
     /*
      * If the device belongs to the hardware domain, and it has a unity mapping,
@@ -356,26 +451,9 @@ static int reassign_device(struct domain
             return rc;
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    rc = allocate_domain_resources(target);
-    if ( rc )
-        return rc;
-
-    amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
     AMD_IOMMU_DEBUG("Re-assign %pp from dom%d to dom%d\n",
                     &pdev->sbdf, source->domain_id, target->domain_id);
 
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
-    }
-
     return 0;
 }
 
@@ -490,8 +568,7 @@ static int amd_iommu_add_device(u8 devfn
         spin_unlock_irqrestore(&iommu->lock, flags);
     }
 
-    amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
-    return 0;
+    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part I)

Arrange for domain ID and page table root to be passed around, the latter in
particular to domain_pgd_maddr() such that taking it from the per-domain
fields can be overridden.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -85,9 +85,10 @@ void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(const void *va);
 int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
                                uint8_t bus, uint8_t devfn,
-                               const struct pci_dev *pdev, unsigned int mode);
+                               const struct pci_dev *pdev, domid_t domid,
+                               paddr_t pgd_maddr, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
-                             u8 bus, u8 devfn);
+                             uint8_t bus, uint8_t devfn, domid_t domid);
 int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 
 unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
@@ -105,7 +106,8 @@ void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
 int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
-                               uint8_t devfn, unsigned int mode);
+                               uint8_t devfn, domid_t domid, paddr_t pgd_maddr,
+                               unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -355,15 +355,17 @@ static u64 addr_to_dma_page_maddr(struct
     return pte_maddr;
 }
 
-static uint64_t domain_pgd_maddr(struct domain *d, unsigned int nr_pt_levels)
+static paddr_t domain_pgd_maddr(struct domain *d, paddr_t pgd_maddr,
+                                unsigned int nr_pt_levels)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    uint64_t pgd_maddr;
     unsigned int agaw;
 
     ASSERT(spin_is_locked(&hd->arch.mapping_lock));
 
-    if ( iommu_use_hap_pt(d) )
+    if ( pgd_maddr )
+        /* nothing */;
+    else if ( iommu_use_hap_pt(d) )
     {
         pagetable_t pgt = p2m_get_pagetable(p2m_get_hostp2m(d));
 
@@ -1376,12 +1378,12 @@ int domain_context_mapping_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
     uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
-    unsigned int mode)
+    domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     struct context_entry *context, *context_entries, lctxt;
     __uint128_t old;
-    u64 maddr, pgd_maddr;
+    uint64_t maddr;
     uint16_t seg = iommu->drhd->segment, prev_did = 0;
     struct domain *prev_dom = NULL;
     int rc, ret;
@@ -1421,10 +1423,12 @@ int domain_context_mapping_one(
     }
     else
     {
+        paddr_t root;
+
         spin_lock(&hd->arch.mapping_lock);
 
-        pgd_maddr = domain_pgd_maddr(domain, iommu->nr_pt_levels);
-        if ( !pgd_maddr )
+        root = domain_pgd_maddr(domain, pgd_maddr, iommu->nr_pt_levels);
+        if ( !root )
         {
             spin_unlock(&hd->arch.mapping_lock);
             spin_unlock(&iommu->lock);
@@ -1434,7 +1438,7 @@ int domain_context_mapping_one(
             return -ENOMEM;
         }
 
-        context_set_address_root(lctxt, pgd_maddr);
+        context_set_address_root(lctxt, root);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
             context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
@@ -1550,15 +1554,21 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, mode);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode);
 
     if ( rc )
     {
         if ( !prev_dom )
-            ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+            ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                           domain->domain_id);
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+        {
+            hd = dom_iommu(prev_dom);
             ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                             domain->domain_id,
+                                             hd->arch.vtd.pgd_maddr,
                                              mode & MAP_WITH_RMRR) < 0;
+        }
         else
             ret = 1;
 
@@ -1580,6 +1590,7 @@ static int domain_context_mapping(struct
 {
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
+    paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1643,7 +1654,8 @@ static int domain_context_mapping(struct
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret > 0 )
             ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
@@ -1657,7 +1669,8 @@ static int domain_context_mapping(struct
                    domain, &PCI_SBDF3(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1685,7 +1698,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1700,7 +1714,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         if ( ret )
         {
@@ -1728,7 +1743,7 @@ static int domain_context_mapping(struct
 int domain_context_unmap_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
-    u8 bus, u8 devfn)
+    uint8_t bus, uint8_t devfn, domid_t domid)
 {
     struct context_entry *context, *context_entries;
     u64 maddr;
@@ -1786,7 +1801,7 @@ int domain_context_unmap_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !iommu->drhd->segment && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, 0, UNMAP_ME_PHANTOM_FUNC);
 
     if ( rc && !is_hardware_domain(domain) && domain != dom_io )
     {
@@ -1837,7 +1852,8 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1847,7 +1863,8 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( ret )
             break;
 
@@ -1873,12 +1890,15 @@ static int domain_context_unmap(struct d
         /* PCIe to PCI/PCIx bridge */
         if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
         {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
             if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0);
+                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                               domain->domain_id);
         }
         else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
 
         break;
 
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -345,6 +345,8 @@ void __init platform_quirks_init(void)
 
 static int __must_check map_me_phantom_function(struct domain *domain,
                                                 unsigned int dev,
+                                                domid_t domid,
+                                                paddr_t pgd_maddr,
                                                 unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
@@ -358,16 +360,17 @@ static int __must_check map_me_phantom_f
     /* map or unmap ME phantom function */
     if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL, mode);
+                                        PCI_DEVFN(dev, 7), NULL,
+                                        domid, pgd_maddr, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
-                                      PCI_DEVFN(dev, 7));
+                                      PCI_DEVFN(dev, 7), domid);
 
     return rc;
 }
 
 int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
-                  unsigned int mode)
+                  domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -391,7 +394,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, mode);
+                rc = map_me_phantom_function(domain, 3, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
@@ -417,7 +420,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, mode);
+                rc = map_me_phantom_function(domain, 22, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part II)

Replace the passing of struct domain * by domid_t in preparation of
per-device quarantine page tables also requiring per-device pseudo
domain IDs, which aren't going to be associated with any struct domain
instances.

No functional change intended (except for slightly adjusted log message
text).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -59,8 +59,8 @@ static struct tasklet vtd_fault_tasklet;
 static int setup_hwdom_device(u8 devfn, struct pci_dev *);
 static void setup_hwdom_rmrr(struct domain *d);
 
-static int domain_iommu_domid(struct domain *d,
-                              struct vtd_iommu *iommu)
+static int get_iommu_did(domid_t domid, const struct vtd_iommu *iommu,
+                         bool warn)
 {
     unsigned long nr_dom, i;
 
@@ -68,16 +68,16 @@ static int domain_iommu_domid(struct dom
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
             return i;
 
         i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
     }
 
-    if ( !d->is_dying )
+    if ( warn )
         dprintk(XENLOG_ERR VTDPREFIX,
-                "Cannot get valid iommu %u domid: %pd\n",
-                iommu->index, d);
+                "No valid iommu %u domid for Dom%d\n",
+                iommu->index, domid);
 
     return -1;
 }
@@ -85,8 +85,7 @@ static int domain_iommu_domid(struct dom
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
 static int context_set_domain_id(struct context_entry *context,
-                                 struct domain *d,
-                                 struct vtd_iommu *iommu)
+                                 domid_t domid, struct vtd_iommu *iommu)
 {
     unsigned long nr_dom, i;
     int found = 0;
@@ -97,7 +96,7 @@ static int context_set_domain_id(struct
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
         {
             found = 1;
             break;
@@ -113,7 +112,7 @@ static int context_set_domain_id(struct
             dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: no free domain ids\n");
             return -EFAULT;
         }
-        iommu->domid_map[i] = d->domain_id;
+        iommu->domid_map[i] = domid;
     }
 
     set_bit(i, iommu->domid_bitmap);
@@ -122,9 +121,9 @@ static int context_set_domain_id(struct
     return 0;
 }
 
-static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
+static void cleanup_domid_map(domid_t domid, struct vtd_iommu *iommu)
 {
-    int iommu_domid = domain_iommu_domid(domain, iommu);
+    int iommu_domid = get_iommu_did(domid, iommu, false);
 
     if ( iommu_domid >= 0 )
     {
@@ -180,7 +179,7 @@ static void check_cleanup_domid_map(stru
     if ( !found )
     {
         clear_bit(iommu->index, &dom_iommu(d)->arch.vtd.iommu_bitmap);
-        cleanup_domid_map(d, iommu);
+        cleanup_domid_map(d->domain_id, iommu);
     }
 }
 
@@ -687,7 +686,7 @@ static int __must_check iommu_flush_iotl
             continue;
 
         flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-        iommu_domid= domain_iommu_domid(d, iommu);
+        iommu_domid = get_iommu_did(d->domain_id, iommu, !d->is_dying);
         if ( iommu_domid == -1 )
             continue;
 
@@ -1447,7 +1446,7 @@ int domain_context_mapping_one(
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    if ( context_set_domain_id(&lctxt, domain, iommu) )
+    if ( context_set_domain_id(&lctxt, domid, iommu) )
     {
     unlock:
         spin_unlock(&iommu->lock);
@@ -1768,7 +1767,7 @@ int domain_context_unmap_one(
     context_clear_entry(*context);
     iommu_sync_cache(context, sizeof(struct context_entry));
 
-    iommu_domid= domain_iommu_domid(domain, iommu);
+    iommu_domid = get_iommu_did(domid, iommu, !domain->is_dying);
     if ( iommu_domid == -1 )
     {
         spin_unlock(&iommu->lock);
@@ -1938,7 +1937,7 @@ static void iommu_domain_teardown(struct
     ASSERT(!hd->arch.vtd.pgd_maddr);
 
     for_each_drhd_unit ( drhd )
-        cleanup_domid_map(d, drhd->iommu);
+        cleanup_domid_map(d->domain_id, drhd->iommu);
 }
 
 static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: maintain a per-device pseudo domain ID

In order to subsequently enable per-device quarantine page tables, we'll
need domain-ID-like identifiers to be inserted in the respective device
(AMD) or context (Intel) table entries alongside the per-device page
table root addresses.

Make use of "real" domain IDs occupying only half of the value range
coverable by domid_t.

Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
leaks in case of error, but existing ones don't get plugged - that'll be
the subject of a later change.

The VT-d changes are slightly asymmetric, but this way we can avoid
assigning pseudo domain IDs to devices which would never be mapped while
still avoiding to add a new parameter to domain_context_unmap().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -142,6 +142,10 @@ int pi_update_irte(const struct pi_desc
         iommu_vcall(ops, sync_cache, addr, size);       \
 })
 
+unsigned long *iommu_init_domid(void);
+domid_t iommu_alloc_domid(unsigned long *map);
+void iommu_free_domid(domid_t domid, unsigned long *map);
+
 int __must_check iommu_free_pgtables(struct domain *d);
 struct domain_iommu;
 struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
--- a/xen/include/asm-x86/pci.h
+++ b/xen/include/asm-x86/pci.h
@@ -15,6 +15,12 @@
 
 struct arch_pci_dev {
     vmask_t used_vectors;
+    /*
+     * These fields are (de)initialized under pcidevs-lock. Other uses of
+     * them don't race (de)initialization and hence don't strictly need any
+     * locking.
+     */
+    domid_t pseudo_domid;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -96,6 +96,7 @@ struct amd_iommu {
     struct ring_buffer cmd_buffer;
     struct ring_buffer event_log;
     struct ring_buffer ppr_log;
+    unsigned long *domid_map;
 
     int exclusion_enable;
     int exclusion_allow_all;
--- a/xen/drivers/passthrough/amd/iommu_detect.c
+++ b/xen/drivers/passthrough/amd/iommu_detect.c
@@ -180,6 +180,11 @@ int __init amd_iommu_detect_one_acpi(
     if ( rt )
         goto out;
 
+    iommu->domid_map = iommu_init_domid();
+    rt = -ENOMEM;
+    if ( !iommu->domid_map )
+        goto out;
+
     rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
     if ( rt )
         printk(XENLOG_ERR "Could not mark config space of %pp read-only (%d)\n",
@@ -190,7 +195,10 @@ int __init amd_iommu_detect_one_acpi(
 
  out:
     if ( rt )
+    {
+        xfree(iommu->domid_map);
         xfree(iommu);
+    }
 
     return rt;
 }
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -508,6 +508,8 @@ static int amd_iommu_add_device(u8 devfn
     struct amd_iommu *iommu;
     u16 bdf;
     struct ivrs_mappings *ivrs_mappings;
+    bool fresh_domid = false;
+    int ret;
 
     if ( !pdev->domain )
         return -EINVAL;
@@ -568,7 +570,22 @@ static int amd_iommu_add_device(u8 devfn
         spin_unlock_irqrestore(&iommu->lock, flags);
     }
 
-    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( iommu_quarantine && pdev->arch.pseudo_domid == DOMID_INVALID )
+    {
+        pdev->arch.pseudo_domid = iommu_alloc_domid(iommu->domid_map);
+        if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+            return -ENOSPC;
+        fresh_domid = true;
+    }
+
+    ret = amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( ret && fresh_domid )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
+    return ret;
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
@@ -591,6 +608,9 @@ static int amd_iommu_remove_device(u8 de
 
     amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
 
+    iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+    pdev->arch.pseudo_domid = DOMID_INVALID;
+
     ivrs_mappings = get_ivrs_mappings(pdev->seg);
     bdf = PCI_BDF2(pdev->bus, devfn);
     if ( amd_iommu_perdev_intremap &&
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -327,6 +327,7 @@ static struct pci_dev *alloc_pdev(struct
     *((u8*) &pdev->bus) = bus;
     *((u8*) &pdev->devfn) = devfn;
     pdev->domain = NULL;
+    pdev->arch.pseudo_domid = DOMID_INVALID;
     INIT_LIST_HEAD(&pdev->msi_list);
 
     pos = pci_find_cap_offset(pseg->nr, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
@@ -1276,8 +1277,12 @@ static int _dump_pci_devices(struct pci_
 
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
     {
-        printk("%pp - %pd - node %-3d - MSIs < ",
-               &pdev->sbdf, pdev->domain,
+        printk("%pp - ", &pdev->sbdf);
+        if ( pdev->domain == dom_io )
+            printk("DomIO:%x", pdev->arch.pseudo_domid);
+        else
+            printk("%pd", pdev->domain);
+        printk(" - node %-3d - MSIs < ",
                (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
         list_for_each_entry ( msi, &pdev->msi_list, list )
                printk("%d ", msi->irq);
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -22,6 +22,7 @@
 #include <xen/sched.h>
 #include <xen/xmalloc.h>
 #include <xen/domain_page.h>
+#include <xen/err.h>
 #include <xen/iocap.h>
 #include <xen/iommu.h>
 #include <xen/numa.h>
@@ -1204,7 +1205,7 @@ int __init iommu_alloc(struct acpi_drhd_
 {
     struct vtd_iommu *iommu;
     unsigned long sagaw, nr_dom;
-    int agaw;
+    int agaw, rc;
 
     if ( nr_iommus >= MAX_IOMMUS )
     {
@@ -1297,7 +1298,16 @@ int __init iommu_alloc(struct acpi_drhd_
     if ( !iommu->domid_map )
         return -ENOMEM;
 
+    iommu->pseudo_domid_map = iommu_init_domid();
+    rc = -ENOMEM;
+    if ( !iommu->pseudo_domid_map )
+        goto free;
+
     return 0;
+
+ free:
+    iommu_free(drhd);
+    return rc;
 }
 
 void __init iommu_free(struct acpi_drhd_unit *drhd)
@@ -1320,6 +1330,7 @@ void __init iommu_free(struct acpi_drhd_
 
     xfree(iommu->domid_bitmap);
     xfree(iommu->domid_map);
+    xfree(iommu->pseudo_domid_map);
 
     if ( iommu->msi.irq >= 0 )
         destroy_irq(iommu->msi.irq);
@@ -1581,8 +1592,8 @@ int domain_context_mapping_one(
     return rc ?: pdev && prev_dom;
 }
 
-static int domain_context_unmap(struct domain *d, uint8_t devfn,
-                                struct pci_dev *pdev);
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *d, uint8_t devfn, struct pci_dev *pdev);
 
 static int domain_context_mapping(struct domain *domain, u8 devfn,
                                   struct pci_dev *pdev)
@@ -1590,6 +1601,7 @@ static int domain_context_mapping(struct
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
     paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
+    domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1649,6 +1661,14 @@ static int domain_context_mapping(struct
         break;
 
     case DEV_TYPE_PCIe_ENDPOINT:
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
@@ -1663,6 +1683,14 @@ static int domain_context_mapping(struct
         break;
 
     case DEV_TYPE_PCI:
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCI: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
@@ -1736,6 +1764,13 @@ static int domain_context_mapping(struct
     if ( !ret && devfn == pdev->devfn )
         pci_vtd_quirk(pdev);
 
+    if ( ret && drhd && orig_domid == DOMID_INVALID )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return ret;
 }
 
@@ -1818,8 +1853,10 @@ int domain_context_unmap_one(
     return rc;
 }
 
-static int domain_context_unmap(struct domain *domain, u8 devfn,
-                                struct pci_dev *pdev)
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *domain,
+    uint8_t devfn,
+    struct pci_dev *pdev)
 {
     struct acpi_drhd_unit *drhd;
     struct vtd_iommu *iommu;
@@ -1829,7 +1866,7 @@ static int domain_context_unmap(struct d
 
     drhd = acpi_find_matched_drhd_unit(pdev);
     if ( !drhd )
-        return -ENODEV;
+        return ERR_PTR(-ENODEV);
     iommu = drhd->iommu;
 
     switch ( pdev->type )
@@ -1839,7 +1876,7 @@ static int domain_context_unmap(struct d
             printk(VTDPREFIX "%pd:Hostbridge: skip %pp unmap\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         if ( !is_hardware_domain(domain) )
-            return -EPERM;
+            return ERR_PTR(-EPERM);
         goto out;
 
     case DEV_TYPE_PCIe_BRIDGE:
@@ -1912,7 +1949,7 @@ static int domain_context_unmap(struct d
         check_cleanup_domid_map(domain, pdev, iommu);
 
 out:
-    return ret;
+    return ret ? ERR_PTR(ret) : drhd;
 }
 
 static void iommu_clear_root_pgtable(struct domain *d)
@@ -2137,16 +2174,17 @@ static int intel_iommu_enable_device(str
 
 static int intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
 {
+    const struct acpi_drhd_unit *drhd;
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int ret, i;
+    unsigned int i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_unmap(pdev->domain, devfn, pdev);
-    if ( ret )
-        return ret;
+    drhd = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( IS_ERR(drhd) )
+        return PTR_ERR(drhd);
 
     for_each_rmrr_device ( rmrr, bdf, i )
     {
@@ -2163,6 +2201,13 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
+    if ( drhd )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return 0;
 }
 
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -535,6 +535,7 @@ struct vtd_iommu {
     } flush;
 
     struct list_head ats_devices;
+    unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */
     unsigned long *domid_bitmap;  /* domain id bitmap */
     u16 *domid_map;               /* domain id mapping array */
     uint32_t version;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -386,6 +386,53 @@ void __hwdom_init arch_iommu_hwdom_init(
         return;
 }
 
+unsigned long *__init iommu_init_domid(void)
+{
+    if ( !iommu_quarantine )
+        return ZERO_BLOCK_PTR;
+
+    BUILD_BUG_ON(DOMID_MASK * 2U >= UINT16_MAX);
+
+    return xzalloc_array(unsigned long,
+                         BITS_TO_LONGS(UINT16_MAX - DOMID_MASK));
+}
+
+domid_t iommu_alloc_domid(unsigned long *map)
+{
+    /*
+     * This is used uniformly across all IOMMUs, such that on typical
+     * systems we wouldn't re-use the same ID very quickly (perhaps never).
+     */
+    static unsigned int start;
+    unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start);
+
+    ASSERT(pcidevs_locked());
+
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK);
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        return DOMID_INVALID;
+
+    __set_bit(idx, map);
+
+    start = idx + 1;
+
+    return idx | (DOMID_MASK + 1);
+}
+
+void iommu_free_domid(domid_t domid, unsigned long *map)
+{
+    ASSERT(pcidevs_locked());
+
+    if ( domid == DOMID_INVALID )
+        return;
+
+    ASSERT(domid > DOMID_MASK);
+
+    if ( !__test_and_clear_bit(domid & DOMID_MASK, map) )
+        BUG();
+}
+
 int iommu_free_pgtables(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: drop TLB flushes from quarantine_init() hooks

The page tables just created aren't hooked up yet anywhere, so there's
nothing that could be present in any TLB, and hence nothing to flush.
Dropping this flush is, at least on the VT-d side, a prereq to per-
device domain ID use when quarantining devices, as dom_io isn't going
to be assigned a DID anymore: The warning in get_iommu_did() would
trigger.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -584,8 +584,6 @@ int __init amd_iommu_quarantine_init(str
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    amd_iommu_flush_all_pages(d);
-
     /* Pages leaked in failure case */
     return level ? -ENOMEM : 0;
 }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2958,9 +2958,6 @@ static int __init intel_iommu_quarantine
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    if ( !rc )
-        rc = iommu_flush_iotlb_all(d);
-
     /* Pages may be leaked in failure case */
     return rc;
 }
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: abstract maximum number of page table levels

We will want to use the constant elsewhere.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -358,7 +358,7 @@ static inline int amd_iommu_get_paging_m
     while ( max_frames > PTE_PER_TABLE_SIZE )
     {
         max_frames = PTE_PER_TABLE_ALIGN(max_frames) >> PTE_PER_TABLE_SHIFT;
-        if ( ++level > 6 )
+        if ( ++level > IOMMU_MAX_PT_LEVELS )
             return -ENOMEM;
     }
 
--- a/xen/drivers/passthrough/amd/iommu-defs.h
+++ b/xen/drivers/passthrough/amd/iommu-defs.h
@@ -106,6 +106,7 @@ struct amd_iommu_dte {
     bool tv:1;
     unsigned int :5;
     unsigned int had:2;
+#define IOMMU_MAX_PT_LEVELS 6
     unsigned int paging_mode:3;
     uint64_t pt_root:40;
     bool ppr:1;
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -250,7 +250,7 @@ static int iommu_pde_from_dfn(struct dom
     table = hd->arch.amd.root_table;
     level = hd->arch.amd.paging_mode;
 
-    BUG_ON( table == NULL || level < 1 || level > 6 );
+    BUG_ON( table == NULL || level < 1 || level > IOMMU_MAX_PT_LEVELS );
 
     /*
      * A frame number past what the current page tables can represent can't
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: use per-device page tables for quarantining

Devices with RMRRs / unity mapped regions, due to it being unspecified
how/when these memory regions may be accessed, may not be left
disconnected from the mappings of these regions (as long as it's not
certain that the device has been fully quiesced). Hence even the page
tables used when quarantining such devices need to have mappings of
those regions. This implies installing page tables in the first place
even when not in scratch-page quarantining mode.

This is CVE-2022-26361 / part of XSA-400.

While for the purpose here it would be sufficient to have devices with
RMRRs / unity mapped regions use per-device page tables, extend this to
all devices (in scratch-page quarantining mode). This allows the leaf
pages to be mapped r/w, thus covering also memory writes (rather than
just reads) issued by non-quiescent devices.

Set up quarantine page tables as late as possible, yet early enough to
not encounter failure during de-assign. This means setup generally
happens in assign_device(), while (for now) the one in deassign_device()
is there mainly to be on the safe side.

In VT-d's DID allocation function don't require the IOMMU lock to be
held anymore: All involved code paths hold pcidevs_lock, so this way we
avoid the need to acquire the IOMMU lock around the new call to
context_set_domain_id().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1468,7 +1468,7 @@ int set_identity_p2m_entry(struct domain
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int ret;
 
-    if ( !paging_mode_translate(p2m->domain) )
+    if ( !paging_mode_translate(d) )
     {
         if ( !is_iommu_enabled(d) )
             return 0;
--- a/xen/include/asm-x86/pci.h
+++ b/xen/include/asm-x86/pci.h
@@ -1,6 +1,8 @@
 #ifndef __X86_PCI_H__
 #define __X86_PCI_H__
 
+#include <xen/mm.h>
+
 #define CF8_BDF(cf8)     (  ((cf8) & 0x00ffff00) >> 8)
 #define CF8_ADDR_LO(cf8) (   (cf8) & 0x000000fc)
 #define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
@@ -20,7 +22,18 @@ struct arch_pci_dev {
      * them don't race (de)initialization and hence don't strictly need any
      * locking.
      */
+    union {
+        /* Subset of struct arch_iommu's fields, to be used in dom_io. */
+        struct {
+            uint64_t pgd_maddr;
+        } vtd;
+        struct {
+            struct page_info *root_table;
+        } amd;
+    };
     domid_t pseudo_domid;
+    mfn_t leaf_mfn;
+    struct page_list_head pgtables_list;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -223,7 +223,8 @@ int amd_iommu_init_late(void);
 int amd_iommu_update_ivrs_mapping_acpi(void);
 int iov_adjust_irq_affinities(void);
 
-int amd_iommu_quarantine_init(struct domain *d);
+int amd_iommu_quarantine_init(struct pci_dev *pdev);
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev);
 
 /* mapping functions */
 int __must_check amd_iommu_map_page(struct domain *d, dfn_t dfn,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -528,64 +528,135 @@ int amd_iommu_reserve_domain_unity_unmap
     return rc;
 }
 
-int __init amd_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(union amd_iommu_pte *this, unsigned int level,
+                    struct page_info *pgs[IOMMU_MAX_PT_LEVELS])
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_PER_TABLE_SIZE; ++i )
+    {
+        union amd_iommu_pte *pte = &this[i], *next;
+
+        if ( !pte->pr )
+        {
+            if ( !pgs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                pgs[level] = iommu_alloc_pgtable(hd);
+                if ( !pgs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                if ( level )
+                {
+                    next = __map_domain_page(pgs[level]);
+                    rc = fill_qpt(next, level - 1, pgs);
+                    unmap_domain_page(next);
+                }
+            }
+
+            /*
+             * PDEs are essentially a subset of PTEs, so this function
+             * is fine to use even at the leaf.
+             */
+            set_iommu_pde_present(pte, mfn_x(page_to_mfn(pgs[level])), level,
+                                  true, true);
+        }
+        else if ( level && pte->next_level )
+        {
+            next = map_domain_page(_mfn(pte->mfn));
+            rc = fill_qpt(next, level - 1, pgs);
+            unmap_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+int amd_iommu_quarantine_init(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
     unsigned long end_gfn =
         1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT);
     unsigned int level = amd_iommu_get_paging_mode(end_gfn);
-    union amd_iommu_pte *table;
+    unsigned int req_id = get_dma_requestor_id(pdev->seg, pdev->sbdf.bdf);
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
+    int rc;
 
-    if ( hd->arch.amd.root_table )
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.amd.root_table);
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
+
+    ASSERT(pdev->arch.pseudo_domid != DOMID_INVALID);
+
+    if ( pdev->arch.amd.root_table )
     {
-        ASSERT_UNREACHABLE();
+        clear_domain_page(pdev->arch.leaf_mfn);
         return 0;
     }
 
-    spin_lock(&hd->arch.mapping_lock);
-
-    hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
-    if ( !hd->arch.amd.root_table )
-        goto out;
-
-    table = __map_domain_page(hd->arch.amd.root_table);
-    while ( level )
+    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd);
+    if ( !pdev->arch.amd.root_table )
+        return -ENOMEM;
+
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
+    hd->arch.amd.root_table = pdev->arch.amd.root_table;
+
+    rc = amd_iommu_reserve_domain_unity_map(dom_io,
+                                            ivrs_mappings[req_id].unity_map,
+                                            0);
+
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.amd.root_table = NULL;
+
+    if ( rc )
+        printk("%pp: quarantine unity mapping failed\n", &pdev->sbdf);
+    else
     {
-        struct page_info *pg;
-        unsigned int i;
+        union amd_iommu_pte *root;
+        struct page_info *pgs[IOMMU_MAX_PT_LEVELS] = {};
 
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        pg = iommu_alloc_pgtable(hd);
-        if ( !pg )
-            break;
+        spin_lock(&hd->arch.mapping_lock);
 
-        for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
-        {
-            union amd_iommu_pte *pde = &table[i];
+        root = __map_domain_page(pdev->arch.amd.root_table);
+        rc = fill_qpt(root, level - 1, pgs);
+        unmap_domain_page(root);
 
-            /*
-             * PDEs are essentially a subset of PTEs, so this function
-             * is fine to use even at the leaf.
-             */
-            set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level - 1,
-                                  false, true);
-        }
+        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
 
-        unmap_domain_page(table);
-        table = __map_domain_page(pg);
-        level--;
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_domain_page(table);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list);
+
+    if ( rc )
+        amd_iommu_quarantine_teardown(pdev);
+
+    return rc;
+}
+
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.amd.root_table )
+        return;
 
-    /* Pages leaked in failure case */
-    return level ? -ENOMEM : 0;
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
+    page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list);
+    while ( iommu_free_pgtables(dom_io) == -ERESTART )
+        /* nothing */;
+    pdev->arch.amd.root_table = NULL;
 }
 
 /*
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -122,6 +122,8 @@ static int __must_check amd_iommu_setup_
     u8 bus = pdev->bus;
     const struct domain_iommu *hd = dom_iommu(domain);
     const struct ivrs_mappings *ivrs_dev;
+    const struct page_info *root_pg;
+    domid_t domid;
 
     BUG_ON(!hd->arch.amd.paging_mode || !iommu->dev_table.buffer);
 
@@ -141,14 +143,25 @@ static int __must_check amd_iommu_setup_
     dte = &table[req_id];
     ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
+    if ( domain != dom_io )
+    {
+        root_pg = hd->arch.amd.root_table;
+        domid = domain->domain_id;
+    }
+    else
+    {
+        root_pg = pdev->arch.amd.root_table;
+        domid = pdev->arch.pseudo_domid;
+    }
+
     spin_lock_irqsave(&iommu->lock, flags);
 
     if ( !dte->v || !dte->tv )
     {
         /* bind DTE to domain page-tables */
         rc = amd_iommu_set_root_page_table(
-                 dte, page_to_maddr(hd->arch.amd.root_table),
-                 domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+                 dte, page_to_maddr(root_pg), domid,
+                 hd->arch.amd.paging_mode, sr_flags);
         if ( rc )
         {
             ASSERT(rc < 0);
@@ -172,7 +185,7 @@ static int __must_check amd_iommu_setup_
 
         amd_iommu_flush_device(iommu, req_id);
     }
-    else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.amd.root_table)) )
+    else if ( dte->pt_root != mfn_x(page_to_mfn(root_pg)) )
     {
         /*
          * Strictly speaking if the device is the only one with this requestor
@@ -185,8 +198,8 @@ static int __must_check amd_iommu_setup_
             rc = -EOPNOTSUPP;
         else
             rc = amd_iommu_set_root_page_table(
-                     dte, page_to_maddr(hd->arch.amd.root_table),
-                     domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+                     dte, page_to_maddr(root_pg), domid,
+                     hd->arch.amd.paging_mode, sr_flags);
         if ( rc < 0 )
         {
             spin_unlock_irqrestore(&iommu->lock, flags);
@@ -205,6 +218,7 @@ static int __must_check amd_iommu_setup_
               * intended anyway.
               */
              !pdev->domain->is_dying &&
+             pdev->domain != dom_io &&
              (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
               pdev->phantom_stride) )
             printk(" %pp: reassignment may cause %pd data corruption\n",
@@ -234,9 +248,8 @@ static int __must_check amd_iommu_setup_
     AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
                     "root table = %#"PRIx64", "
                     "domain = %d, paging mode = %d\n",
-                    req_id, pdev->type,
-                    page_to_maddr(hd->arch.amd.root_table),
-                    domain->domain_id, hd->arch.amd.paging_mode);
+                    req_id, pdev->type, page_to_maddr(root_pg),
+                    domid, hd->arch.amd.paging_mode);
 
     ASSERT(pcidevs_locked());
 
@@ -305,7 +318,7 @@ int amd_iommu_alloc_root(struct domain *
 {
     struct domain_iommu *hd = dom_iommu(d);
 
-    if ( unlikely(!hd->arch.amd.root_table) )
+    if ( unlikely(!hd->arch.amd.root_table) && d != dom_io )
     {
         hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
         if ( !hd->arch.amd.root_table )
@@ -396,7 +409,7 @@ static void amd_iommu_disable_domain_dev
 
         AMD_IOMMU_DEBUG("Disable: device id = %#x, "
                         "domain = %d, paging mode = %d\n",
-                        req_id,  domain->domain_id,
+                        req_id, dte->domain_id,
                         dom_iommu(domain)->arch.amd.paging_mode);
     }
     spin_unlock_irqrestore(&iommu->lock, flags);
@@ -608,6 +621,8 @@ static int amd_iommu_remove_device(u8 de
 
     amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev);
 
+    amd_iommu_quarantine_teardown(pdev);
+
     iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
     pdev->arch.pseudo_domid = DOMID_INVALID;
 
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -424,21 +424,21 @@ int iommu_iotlb_flush_all(struct domain
     return rc;
 }
 
-static int __init iommu_quarantine_init(void)
+int iommu_quarantine_dev_init(device_t *dev)
 {
     const struct domain_iommu *hd = dom_iommu(dom_io);
-    int rc;
 
-    dom_io->options |= XEN_DOMCTL_CDF_iommu;
+    if ( !iommu_quarantine || !hd->platform_ops->quarantine_init )
+        return 0;
 
-    rc = iommu_domain_init(dom_io, 0);
-    if ( rc )
-        return rc;
+    return iommu_call(hd->platform_ops, quarantine_init, dev);
+}
 
-    if ( !hd->platform_ops->quarantine_init )
-        return 0;
+static int __init iommu_quarantine_init(void)
+{
+    dom_io->options |= XEN_DOMCTL_CDF_iommu;
 
-    return hd->platform_ops->quarantine_init(dom_io);
+    return iommu_domain_init(dom_io, 0);
 }
 
 int __init iommu_setup(void)
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -858,9 +858,16 @@ static int deassign_device(struct domain
         return -ENODEV;
 
     /* De-assignment from dom_io should de-quarantine the device */
-    target = ((pdev->quarantine || iommu_quarantine) &&
-              pdev->domain != dom_io) ?
-        dom_io : hardware_domain;
+    if ( (pdev->quarantine || iommu_quarantine) && pdev->domain != dom_io )
+    {
+        ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( ret )
+           return ret;
+
+        target = dom_io;
+    }
+    else
+        target = hardware_domain;
 
     while ( pdev->phantom_stride )
     {
@@ -1441,6 +1448,13 @@ static int assign_device(struct domain *
         msixtbl_init(d);
     }
 
+    if ( pdev->domain != dom_io )
+    {
+        rc = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( rc )
+            goto done;
+    }
+
     pdev->fault.count = 0;
 
     if ( (rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag)) )
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -43,6 +43,12 @@
 #include "vtd.h"
 #include "../ats.h"
 
+#define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
+                                             : (pdev)->arch.pseudo_domid)
+#define DEVICE_PGTABLE(d, pdev) ((d) != dom_io \
+                                 ? dom_iommu(d)->arch.vtd.pgd_maddr \
+                                 : (pdev)->arch.vtd.pgd_maddr)
+
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
 
@@ -85,13 +91,18 @@ static int get_iommu_did(domid_t domid,
 
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
+
+/*
+ * This function may have "context" passed as NULL, to merely obtain a DID
+ * for "domid".
+ */
 static int context_set_domain_id(struct context_entry *context,
                                  domid_t domid, struct vtd_iommu *iommu)
 {
     unsigned long nr_dom, i;
     int found = 0;
 
-    ASSERT(spin_is_locked(&iommu->lock));
+    ASSERT(pcidevs_locked());
 
     nr_dom = cap_ndoms(iommu->cap);
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
@@ -117,8 +128,13 @@ static int context_set_domain_id(struct
     }
 
     set_bit(i, iommu->domid_bitmap);
-    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
-    context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+
+    if ( context )
+    {
+        context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
+        context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+    }
+
     return 0;
 }
 
@@ -168,8 +184,12 @@ static void check_cleanup_domid_map(stru
                                     const struct pci_dev *exclude,
                                     struct vtd_iommu *iommu)
 {
-    bool found = any_pdev_behind_iommu(d, exclude, iommu);
+    bool found;
+
+    if ( d == dom_io )
+        return;
 
+    found = any_pdev_behind_iommu(d, exclude, iommu);
     /*
      * Hidden devices are associated with DomXEN but usable by the hardware
      * domain. Hence they need considering here as well.
@@ -1414,7 +1434,7 @@ int domain_context_mapping_one(
         domid = iommu->domid_map[prev_did];
         if ( domid < DOMID_FIRST_RESERVED )
             prev_dom = rcu_lock_domain_by_id(domid);
-        else if ( domid == DOMID_IO )
+        else if ( pdev ? domid == pdev->arch.pseudo_domid : domid > DOMID_MASK )
             prev_dom = rcu_lock_domain(dom_io);
         if ( !prev_dom )
         {
@@ -1570,15 +1590,12 @@ int domain_context_mapping_one(
     {
         if ( !prev_dom )
             ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                           domain->domain_id);
+                                           DEVICE_DOMID(domain, pdev));
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
-        {
-            hd = dom_iommu(prev_dom);
             ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
-                                             domain->domain_id,
-                                             hd->arch.vtd.pgd_maddr,
+                                             DEVICE_DOMID(prev_dom, pdev),
+                                             DEVICE_PGTABLE(prev_dom, pdev),
                                              mode & MAP_WITH_RMRR) < 0;
-        }
         else
             ret = 1;
 
@@ -1600,7 +1617,7 @@ static int domain_context_mapping(struct
 {
     struct acpi_drhd_unit *drhd;
     const struct acpi_rmrr_unit *rmrr;
-    paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
+    paddr_t pgd_maddr = DEVICE_PGTABLE(domain, pdev);
     domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
@@ -1633,7 +1650,7 @@ static int domain_context_mapping(struct
         break;
     }
 
-    if ( domain != pdev->domain )
+    if ( domain != pdev->domain && pdev->domain != dom_io )
     {
         if ( pdev->domain->is_dying )
             mode |= MAP_OWNER_DYING;
@@ -1672,8 +1689,8 @@ static int domain_context_mapping(struct
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
+        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, pdev,
+                                         DEVICE_DOMID(domain, pdev), pgd_maddr,
                                          mode);
         if ( ret > 0 )
             ret = 0;
@@ -1696,8 +1713,8 @@ static int domain_context_mapping(struct
                    domain, &PCI_SBDF3(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
-                                         mode);
+                                         pdev, DEVICE_DOMID(domain, pdev),
+                                         pgd_maddr, mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1725,8 +1742,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1741,8 +1758,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         if ( ret )
         {
@@ -1889,7 +1906,7 @@ static const struct acpi_drhd_unit *doma
             printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1900,7 +1917,7 @@ static const struct acpi_drhd_unit *doma
             printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( ret )
             break;
 
@@ -1923,18 +1940,12 @@ static const struct acpi_drhd_unit *doma
             break;
         }
 
+        ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                       DEVICE_DOMID(domain, pdev));
         /* PCIe to PCI/PCIx bridge */
-        if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
-        {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
-            if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
-                                               domain->domain_id);
-        }
-        else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
+        if ( !ret && pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
+            ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                           DEVICE_DOMID(domain, pdev));
 
         break;
 
@@ -1977,6 +1988,26 @@ static void iommu_domain_teardown(struct
         cleanup_domid_map(d->domain_id, drhd->iommu);
 }
 
+static void quarantine_teardown(struct pci_dev *pdev,
+                                const struct acpi_drhd_unit *drhd)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.vtd.pgd_maddr )
+        return;
+
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
+    page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list);
+    while ( iommu_free_pgtables(dom_io) == -ERESTART )
+        /* nothing */;
+    pdev->arch.vtd.pgd_maddr = 0;
+
+    if ( drhd )
+        cleanup_domid_map(pdev->arch.pseudo_domid, drhd->iommu);
+}
+
 static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
                                              mfn_t mfn, unsigned int flags,
                                              unsigned int *flush_flags)
@@ -2201,6 +2232,8 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
+    quarantine_teardown(pdev, drhd);
+
     if ( drhd )
     {
         iommu_free_domid(pdev->arch.pseudo_domid,
@@ -2896,69 +2929,135 @@ static void vtd_dump_page_tables(struct
                               agaw_to_level(hd->arch.vtd.agaw), 0, 0);
 }
 
-static int __init intel_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(struct dma_pte *this, unsigned int level,
+                    struct page_info *pgs[6])
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_NUM; ++i )
+    {
+        struct dma_pte *pte = &this[i], *next;
+
+        if ( !dma_pte_present(*pte) )
+        {
+            if ( !pgs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                pgs[level] = iommu_alloc_pgtable(hd);
+                if ( !pgs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                if ( level )
+                {
+                    next = map_vtd_domain_page(page_to_maddr(pgs[level]));
+                    rc = fill_qpt(next, level - 1, pgs);
+                    unmap_vtd_domain_page(next);
+                }
+            }
+
+            dma_set_pte_addr(*pte, page_to_maddr(pgs[level]));
+            dma_set_pte_readable(*pte);
+            dma_set_pte_writable(*pte);
+        }
+        else if ( level && !dma_pte_superpage(*pte) )
+        {
+            next = map_vtd_domain_page(dma_pte_addr(*pte));
+            rc = fill_qpt(next, level - 1, pgs);
+            unmap_vtd_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+static int intel_iommu_quarantine_init(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
     struct page_info *pg;
-    struct dma_pte *parent;
     unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
     unsigned int level = agaw_to_level(agaw);
-    int rc = 0;
+    const struct acpi_drhd_unit *drhd;
+    const struct acpi_rmrr_unit *rmrr;
+    unsigned int i, bdf;
+    bool rmrr_found = false;
+    int rc;
 
-    spin_lock(&hd->arch.mapping_lock);
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.vtd.pgd_maddr);
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
 
-    if ( hd->arch.vtd.pgd_maddr )
+    if ( pdev->arch.vtd.pgd_maddr )
     {
-        ASSERT_UNREACHABLE();
-        goto out;
+        clear_domain_page(pdev->arch.leaf_mfn);
+        return 0;
     }
 
-    pg = iommu_alloc_pgtable(hd);
+    drhd = acpi_find_matched_drhd_unit(pdev);
+    if ( !drhd )
+        return -ENODEV;
 
-    rc = -ENOMEM;
+    pg = iommu_alloc_pgtable(hd);
     if ( !pg )
-        goto out;
+        return -ENOMEM;
 
+    rc = context_set_domain_id(NULL, pdev->arch.pseudo_domid, drhd->iommu);
+
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
     hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
 
-    parent = map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
-    while ( level )
+    for_each_rmrr_device ( rmrr, bdf, i )
     {
-        uint64_t maddr;
-        unsigned int offset;
-
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        pg = iommu_alloc_pgtable(hd);
-
-        if ( !pg )
-            goto out;
+        if ( rc )
+            break;
 
-        maddr = page_to_maddr(pg);
-        for ( offset = 0; offset < PTE_NUM; offset++ )
+        if ( rmrr->segment == pdev->seg && bdf == pdev->sbdf.bdf )
         {
-            struct dma_pte *pte = &parent[offset];
+            rmrr_found = true;
 
-            dma_set_pte_addr(*pte, maddr);
-            dma_set_pte_readable(*pte);
+            rc = iommu_identity_mapping(dom_io, p2m_access_rw,
+                                        rmrr->base_address, rmrr->end_address,
+                                        0);
+            if ( rc )
+                printk(XENLOG_ERR VTDPREFIX
+                       "%pp: RMRR quarantine mapping failed\n",
+                       &pdev->sbdf);
         }
-        iommu_sync_cache(parent, PAGE_SIZE);
+    }
 
-        unmap_vtd_domain_page(parent);
-        parent = map_vtd_domain_page(maddr);
-        level--;
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.vtd.pgd_maddr = 0;
+    pdev->arch.vtd.pgd_maddr = page_to_maddr(pg);
+
+    if ( !rc )
+    {
+        struct dma_pte *root;
+        struct page_info *pgs[6] = {};
+
+        spin_lock(&hd->arch.mapping_lock);
+
+        root = map_vtd_domain_page(pdev->arch.vtd.pgd_maddr);
+        rc = fill_qpt(root, level - 1, pgs);
+        unmap_vtd_domain_page(root);
+
+        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
+
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_vtd_domain_page(parent);
 
-    rc = 0;
+    page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    if ( rc )
+        quarantine_teardown(pdev, drhd);
 
-    /* Pages may be leaked in failure case */
     return rc;
 }
 
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -509,7 +509,7 @@ struct vtd_iommu {
     u32 nr_pt_levels;
     u64	cap;
     u64	ecap;
-    spinlock_t lock; /* protect context, domain ids */
+    spinlock_t lock; /* protect context */
     spinlock_t register_lock; /* protect iommu register handling */
     u64 root_maddr; /* root entry machine address */
     nodeid_t node;
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -234,7 +234,7 @@ typedef int iommu_grdm_t(xen_pfn_t start
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
-    int (*quarantine_init)(struct domain *d);
+    int (*quarantine_init)(device_t *dev);
     int (*add_device)(u8 devfn, device_t *dev);
     int (*enable_device)(device_t *dev);
     int (*remove_device)(u8 devfn, device_t *dev);
@@ -352,6 +352,7 @@ int __must_check iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
 int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
+int iommu_quarantine_dev_init(device_t *dev);
 
 #ifdef CONFIG_HAS_PCI
 int iommu_do_pci_domctl(struct xen_domctl *, struct domain *d,
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix (de)assign ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully updated.

Also adjust a related log message.

This is CVE-2022-26358 / part of XSA-400.

Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2419,6 +2419,10 @@ static int reassign_device_ownership(
 {
     int ret;
 
+    ret = domain_context_unmap(source, devfn, pdev);
+    if ( ret )
+        return ret;
+
     /*
      * Devices assigned to untrusted domains (here assumed to be any domU)
      * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
@@ -2455,10 +2459,6 @@ static int reassign_device_ownership(
             }
     }
 
-    ret = domain_context_unmap(source, devfn, pdev);
-    if ( ret )
-        return ret;
-
     if ( devfn == pdev->devfn && pdev->domain != dom_io )
     {
         list_move(&pdev->domain_list, &dom_io->pdev_list);
@@ -2534,9 +2534,8 @@ static int intel_iommu_assign_device(
         }
     }
 
-    ret = reassign_device_ownership(s, d, devfn, pdev);
-    if ( ret || d == dom_io )
-        return ret;
+    if ( d == dom_io )
+        return reassign_device_ownership(s, d, devfn, pdev);
 
     /* Setup rmrr identity mapping */
     for_each_rmrr_device( rmrr, bdf, i )
@@ -2549,20 +2548,37 @@ static int intel_iommu_assign_device(
                                          rmrr->end_address, flag);
             if ( ret )
             {
-                int rc;
-
-                rc = reassign_device_ownership(d, s, devfn, pdev);
                 printk(XENLOG_G_ERR VTDPREFIX
-                       " cannot map reserved region (%"PRIx64",%"PRIx64"] for Dom%d (%d)\n",
-                       rmrr->base_address, rmrr->end_address,
-                       d->domain_id, ret);
-                if ( rc )
-                {
-                    printk(XENLOG_ERR VTDPREFIX
-                           " failed to reclaim %pp from %pd (%d)\n",
-                           &PCI_SBDF3(seg, bus, devfn), d, rc);
-                    domain_crash(d);
-                }
+                       "%pd: cannot map reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, ret);
+                break;
+            }
+        }
+    }
+
+    if ( !ret )
+        ret = reassign_device_ownership(s, d, devfn, pdev);
+
+    /* See reassign_device_ownership() for the hwdom aspect. */
+    if ( !ret || is_hardware_domain(d) )
+        return ret;
+
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment == seg &&
+             PCI_BUS(bdf) == bus &&
+             PCI_DEVFN2(bdf) == devfn )
+        {
+            int rc = iommu_identity_mapping(d, p2m_access_x,
+                                            rmrr->base_address,
+                                            rmrr->end_address, 0);
+
+            if ( rc && rc != -ENOENT )
+            {
+                printk(XENLOG_ERR VTDPREFIX
+                       "%pd: cannot unmap reserved region [%"PRIx64",%"PRIx64"]: %d\n",
+                       d, rmrr->base_address, rmrr->end_address, rc);
+                domain_crash(d);
                 break;
             }
         }
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: fix add/remove ordering when RMRRs are in use

In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully cleared.

Also switch to %pd in related log messages.

Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR")
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1997,14 +1997,6 @@ static int intel_iommu_add_device(u8 dev
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_mapping(pdev->domain, devfn, pdev);
-    if ( ret )
-    {
-        dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n",
-                pdev->domain->domain_id);
-        return ret;
-    }
-
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment == pdev->seg &&
@@ -2021,12 +2013,17 @@ static int intel_iommu_add_device(u8 dev
                                          rmrr->base_address, rmrr->end_address,
                                          0);
             if ( ret )
-                dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n",
-                        pdev->domain->domain_id);
+                dprintk(XENLOG_ERR VTDPREFIX, "%pd: RMRR mapping failed\n",
+                        pdev->domain);
         }
     }
 
-    return 0;
+    ret = domain_context_mapping(pdev->domain, devfn, pdev);
+    if ( ret )
+        dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n",
+                pdev->domain);
+
+    return ret;
 }
 
 static int intel_iommu_enable_device(struct pci_dev *pdev)
@@ -2048,11 +2045,15 @@ static int intel_iommu_remove_device(u8
 {
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int i;
+    int ret, i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
+    ret = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( ret )
+        return ret;
+
     for_each_rmrr_device ( rmrr, bdf, i )
     {
         if ( rmrr->segment != pdev->seg ||
@@ -2068,7 +2069,7 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
-    return domain_context_unmap(pdev->domain, devfn, pdev);
+    return 0;
 }
 
 static int __hwdom_init setup_hwdom_device(u8 devfn, struct pci_dev *pdev)
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: tighten iommu_alloc_pgtable()'s parameter

This is to make more obvious that nothing outside of domain_iommu(d)
actually changes or is otherwise needed by the function.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -142,7 +142,8 @@ int pi_update_irte(const struct pi_desc
 })
 
 int __must_check iommu_free_pgtables(struct domain *d);
-struct page_info *__must_check iommu_alloc_pgtable(struct domain *d);
+struct domain_iommu;
+struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
 
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -184,7 +184,7 @@ static int iommu_pde_from_dfn(struct dom
     unsigned long  next_table_mfn;
     unsigned int level;
     struct page_info *table;
-    const struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(d);
 
     table = hd->arch.amd.root_table;
     level = hd->arch.amd.paging_mode;
@@ -219,7 +219,7 @@ static int iommu_pde_from_dfn(struct dom
             mfn = next_table_mfn;
 
             /* allocate lower level page table */
-            table = iommu_alloc_pgtable(d);
+            table = iommu_alloc_pgtable(hd);
             if ( table == NULL )
             {
                 AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
@@ -249,7 +249,7 @@ static int iommu_pde_from_dfn(struct dom
 
             if ( next_table_mfn == 0 )
             {
-                table = iommu_alloc_pgtable(d);
+                table = iommu_alloc_pgtable(hd);
                 if ( table == NULL )
                 {
                     AMD_IOMMU_ERROR("cannot allocate I/O page table\n");
@@ -553,7 +553,7 @@ int __init amd_iommu_quarantine_init(str
 
     spin_lock(&hd->arch.mapping_lock);
 
-    hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+    hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
     if ( !hd->arch.amd.root_table )
         goto out;
 
@@ -568,7 +568,7 @@ int __init amd_iommu_quarantine_init(str
          * page table pages, and the resulting allocations are always
          * zeroed.
          */
-        pg = iommu_alloc_pgtable(d);
+        pg = iommu_alloc_pgtable(hd);
         if ( !pg )
             break;
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -242,7 +242,7 @@ int amd_iommu_alloc_root(struct domain *
 
     if ( unlikely(!hd->arch.amd.root_table) )
     {
-        hd->arch.amd.root_table = iommu_alloc_pgtable(d);
+        hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
         if ( !hd->arch.amd.root_table )
             return -ENOMEM;
     }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -330,7 +330,7 @@ static u64 addr_to_dma_page_maddr(struct
     {
         struct page_info *pg;
 
-        if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) )
+        if ( !alloc || !(pg = iommu_alloc_pgtable(hd)) )
             goto out;
 
         hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
@@ -350,7 +350,7 @@ static u64 addr_to_dma_page_maddr(struct
             if ( !alloc )
                 break;
 
-            pg = iommu_alloc_pgtable(domain);
+            pg = iommu_alloc_pgtable(hd);
             if ( !pg )
                 break;
 
@@ -2766,7 +2766,7 @@ static int __init intel_iommu_quarantine
         goto out;
     }
 
-    pg = iommu_alloc_pgtable(d);
+    pg = iommu_alloc_pgtable(hd);
 
     rc = -ENOMEM;
     if ( !pg )
@@ -2785,7 +2785,7 @@ static int __init intel_iommu_quarantine
          * page table pages, and the resulting allocations are always
          * zeroed.
          */
-        pg = iommu_alloc_pgtable(d);
+        pg = iommu_alloc_pgtable(hd);
 
         if ( !pg )
             goto out;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -416,9 +416,8 @@ int iommu_free_pgtables(struct domain *d
     return 0;
 }
 
-struct page_info *iommu_alloc_pgtable(struct domain *d)
+struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd)
 {
-    struct domain_iommu *hd = dom_iommu(d);
     unsigned int memflags = 0;
     struct page_info *pg;
     void *p;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: drop ownership checking from domain_context_mapping_one()

Despite putting in quite a bit of effort it was not possible to
establish why exactly this code exists (beyond possibly sanity
checking). Instead of a subsequent change further complicating this
logic, simply get rid of it.

Take the opportunity and move the respective unmap_vtd_domain_page() out
of the locked region.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -124,28 +124,6 @@ static int context_set_domain_id(struct
     return 0;
 }
 
-static int context_get_domain_id(struct context_entry *context,
-                                 struct vtd_iommu *iommu)
-{
-    unsigned long dom_index, nr_dom;
-    int domid = -1;
-
-    if (iommu && context)
-    {
-        nr_dom = cap_ndoms(iommu->cap);
-
-        dom_index = context_domain_id(*context);
-
-        if ( dom_index < nr_dom && iommu->domid_map )
-            domid = iommu->domid_map[dom_index];
-        else
-            dprintk(XENLOG_DEBUG VTDPREFIX,
-                    "dom_index %lu exceeds nr_dom %lu or iommu has no domid_map\n",
-                    dom_index, nr_dom);
-    }
-    return domid;
-}
-
 static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
 {
     int iommu_domid = domain_iommu_domid(domain, iommu);
@@ -1416,44 +1394,9 @@ int domain_context_mapping_one(
 
     if ( context_present(*context) )
     {
-        int res = 0;
-
-        /* Try to get domain ownership from device structure.  If that's
-         * not available, try to read it from the context itself. */
-        if ( pdev )
-        {
-            if ( pdev->domain != domain )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX "%pd: %pp owned by %pd",
-                       domain, &PCI_SBDF3(seg, bus, devfn),
-                       pdev->domain);
-                res = -EINVAL;
-            }
-        }
-        else
-        {
-            int cdomain;
-            cdomain = context_get_domain_id(context, iommu);
-            
-            if ( cdomain < 0 )
-            {
-                printk(XENLOG_G_WARNING VTDPREFIX
-                       "%pd: %pp mapped, but can't find owner\n",
-                       domain, &PCI_SBDF3(seg, bus, devfn));
-                res = -EINVAL;
-            }
-            else if ( cdomain != domain->domain_id )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX
-                       "%pd: %pp already mapped to d%d",
-                       domain, &PCI_SBDF3(seg, bus, devfn), cdomain);
-                res = -EINVAL;
-            }
-        }
-
-        unmap_vtd_domain_page(context_entries);
         spin_unlock(&iommu->lock);
-        return res;
+        unmap_vtd_domain_page(context_entries);
+        return 0;
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: re-assign devices directly

Devices with RMRRs, due to it being unspecified how/when the specified
memory regions may get accessed, may not be left disconnected from their
respective mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than unmapping the old context and
then mapping the new one, re-assignment needs to be done in a single
step.

This is CVE-2022-26359 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any RMRRs. The
main difference is when it comes to updating context entries, which need
to be atomic when there are RMRRs. Yet atomicity can only be achieved
with CMPXCHG16B, availability of which we can't take for given.

The seemingly complicated choice of non-negative return values for
domain_context_mapping_one() is to limit code churn: This way callers
passing NULL for pdev don't need fiddling with.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -84,7 +84,8 @@ void free_pgtable_maddr(u64 maddr);
 void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(const void *va);
 int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
-                               u8 bus, u8 devfn, const struct pci_dev *);
+                               uint8_t bus, uint8_t devfn,
+                               const struct pci_dev *pdev, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
                              u8 bus, u8 devfn);
 int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
@@ -104,8 +105,8 @@ bool is_azalia_tlb_enabled(const struct
 void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
-int __must_check me_wifi_quirk(struct domain *domain,
-                               u8 bus, u8 devfn, int map);
+int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
+                               uint8_t devfn, unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -120,6 +120,7 @@ static int context_set_domain_id(struct
     }
 
     set_bit(i, iommu->domid_bitmap);
+    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
     context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
     return 0;
 }
@@ -1371,15 +1372,27 @@ static void __hwdom_init intel_iommu_hwd
     }
 }
 
+/*
+ * This function returns
+ * - a negative errno value upon error,
+ * - zero upon success when previously the entry was non-present, or this isn't
+ *   the "main" request for a device (pdev == NULL), or for no-op quarantining
+ *   assignments,
+ * - positive (one) upon success when previously the entry was present and this
+ *   is the "main" request for a device (pdev != NULL).
+ */
 int domain_context_mapping_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
-    u8 bus, u8 devfn, const struct pci_dev *pdev)
+    uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
+    unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
-    struct context_entry *context, *context_entries;
+    struct context_entry *context, *context_entries, lctxt;
+    __uint128_t old;
     u64 maddr, pgd_maddr;
-    u16 seg = iommu->drhd->segment;
+    uint16_t seg = iommu->drhd->segment, prev_did = 0;
+    struct domain *prev_dom = NULL;
     int rc, ret;
     bool_t flush_dev_iotlb;
 
@@ -1391,17 +1404,32 @@ int domain_context_mapping_one(
     maddr = bus_to_context_maddr(iommu, bus);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
     context = &context_entries[devfn];
+    old = (lctxt = *context).full;
 
-    if ( context_present(*context) )
+    if ( context_present(lctxt) )
     {
-        spin_unlock(&iommu->lock);
-        unmap_vtd_domain_page(context_entries);
-        return 0;
+        domid_t domid;
+
+        prev_did = context_domain_id(lctxt);
+        domid = iommu->domid_map[prev_did];
+        if ( domid < DOMID_FIRST_RESERVED )
+            prev_dom = rcu_lock_domain_by_id(domid);
+        else if ( domid == DOMID_IO )
+            prev_dom = rcu_lock_domain(dom_io);
+        if ( !prev_dom )
+        {
+            spin_unlock(&iommu->lock);
+            unmap_vtd_domain_page(context_entries);
+            dprintk(XENLOG_DEBUG VTDPREFIX,
+                    "no domain for did %u (nr_dom %u)\n",
+                    prev_did, cap_ndoms(iommu->cap));
+            return -ESRCH;
+        }
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
     {
-        context_set_translation_type(*context, CONTEXT_TT_PASS_THRU);
+        context_set_translation_type(lctxt, CONTEXT_TT_PASS_THRU);
     }
     else
     {
@@ -1413,36 +1441,107 @@ int domain_context_mapping_one(
             spin_unlock(&hd->arch.mapping_lock);
             spin_unlock(&iommu->lock);
             unmap_vtd_domain_page(context_entries);
+            if ( prev_dom )
+                rcu_unlock_domain(prev_dom);
             return -ENOMEM;
         }
 
-        context_set_address_root(*context, pgd_maddr);
+        context_set_address_root(lctxt, pgd_maddr);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
-            context_set_translation_type(*context, CONTEXT_TT_DEV_IOTLB);
+            context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
-            context_set_translation_type(*context, CONTEXT_TT_MULTI_LEVEL);
+            context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL);
 
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    if ( context_set_domain_id(context, domain, iommu) )
+    if ( context_set_domain_id(&lctxt, domain, iommu) )
     {
+    unlock:
         spin_unlock(&iommu->lock);
         unmap_vtd_domain_page(context_entries);
+        if ( prev_dom )
+            rcu_unlock_domain(prev_dom);
         return -EFAULT;
     }
 
-    context_set_address_width(*context, level_to_agaw(iommu->nr_pt_levels));
-    context_set_fault_enable(*context);
-    context_set_present(*context);
+    if ( !prev_dom )
+    {
+        context_set_address_width(lctxt, level_to_agaw(iommu->nr_pt_levels));
+        context_set_fault_enable(lctxt);
+        context_set_present(lctxt);
+    }
+    else if ( prev_dom == domain )
+    {
+        ASSERT(lctxt.full == context->full);
+        rc = !!pdev;
+        goto unlock;
+    }
+    else
+    {
+        ASSERT(context_address_width(lctxt) ==
+               level_to_agaw(iommu->nr_pt_levels));
+        ASSERT(!context_fault_disable(lctxt));
+    }
+
+    if ( cpu_has_cx16 )
+    {
+        __uint128_t res = cmpxchg16b(context, &old, &lctxt.full);
+
+        /*
+         * Hardware does not update the context entry behind our backs,
+         * so the return value should match "old".
+         */
+        if ( res != old )
+        {
+            if ( pdev )
+                check_cleanup_domid_map(domain, pdev, iommu);
+            printk(XENLOG_ERR
+                   "%pp: unexpected context entry %016lx_%016lx (expected %016lx_%016lx)\n",
+                   &PCI_SBDF3(pdev->seg, pdev->bus, devfn),
+                   (uint64_t)(res >> 64), (uint64_t)res,
+                   (uint64_t)(old >> 64), (uint64_t)old);
+            rc = -EILSEQ;
+            goto unlock;
+        }
+    }
+    else if ( !prev_dom || !(mode & MAP_WITH_RMRR) )
+    {
+        context_clear_present(*context);
+        iommu_sync_cache(context, sizeof(*context));
+
+        write_atomic(&context->hi, lctxt.hi);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->lo, lctxt.lo);
+    }
+    else /* Best effort, updating DID last. */
+    {
+         /*
+          * By non-atomically updating the context entry's DID field last,
+          * during a short window in time TLB entries with the old domain ID
+          * but the new page tables may be inserted.  This could affect I/O
+          * of other devices using this same (old) domain ID.  Such updating
+          * therefore is not a problem if this was the only device associated
+          * with the old domain ID.  Diverting I/O of any of a dying domain's
+          * devices to the quarantine page tables is intended anyway.
+          */
+        if ( !(mode & (MAP_OWNER_DYING | MAP_SINGLE_DEVICE)) )
+            printk(XENLOG_WARNING VTDPREFIX
+                   " %pp: reassignment may cause %pd data corruption\n",
+                   &PCI_SBDF3(seg, bus, devfn), prev_dom);
+
+        write_atomic(&context->lo, lctxt.lo);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->hi, lctxt.hi);
+    }
+
     iommu_sync_cache(context, sizeof(struct context_entry));
     spin_unlock(&iommu->lock);
 
-    /* Context entry was previously non-present (with domid 0). */
-    rc = iommu_flush_context_device(iommu, 0, PCI_BDF2(bus, devfn),
-                                    DMA_CCMD_MASK_NOBIT, 1);
+    rc = iommu_flush_context_device(iommu, prev_did, PCI_BDF2(bus, devfn),
+                                    DMA_CCMD_MASK_NOBIT, !prev_dom);
     flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-    ret = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
+    ret = iommu_flush_iotlb_dsi(iommu, prev_did, !prev_dom, flush_dev_iotlb);
 
     /*
      * The current logic for returns:
@@ -1463,17 +1562,26 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, mode);
 
     if ( rc )
     {
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        if ( !prev_dom )
+            ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+            ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                             mode & MAP_WITH_RMRR) < 0;
+        else
+            ret = 1;
 
         if ( !ret && pdev && pdev->devfn == devfn )
             check_cleanup_domid_map(domain, pdev, iommu);
     }
 
-    return rc;
+    if ( prev_dom )
+        rcu_unlock_domain(prev_dom);
+
+    return rc ?: pdev && prev_dom;
 }
 
 static int domain_context_unmap(struct domain *d, uint8_t devfn,
@@ -1483,8 +1591,10 @@ static int domain_context_mapping(struct
                                   struct pci_dev *pdev)
 {
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+    const struct acpi_rmrr_unit *rmrr;
     int ret = 0;
-    uint16_t seg = pdev->seg;
+    unsigned int i, mode = 0;
+    uint16_t seg = pdev->seg, bdf;
     uint8_t bus = pdev->bus, secbus;
 
     /*
@@ -1500,8 +1610,29 @@ static int domain_context_mapping(struct
 
     ASSERT(pcidevs_locked());
 
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment != pdev->seg || bdf != pdev->sbdf.bdf )
+            continue;
+
+        mode |= MAP_WITH_RMRR;
+        break;
+    }
+
+    if ( domain != pdev->domain )
+    {
+        if ( pdev->domain->is_dying )
+            mode |= MAP_OWNER_DYING;
+        else if ( drhd &&
+                  !any_pdev_behind_iommu(pdev->domain, pdev, drhd->iommu) &&
+                  !pdev->phantom_stride )
+            mode |= MAP_SINGLE_DEVICE;
+    }
+
     switch ( pdev->type )
     {
+        bool prev_present;
+
     case DEV_TYPE_PCI_HOST_BRIDGE:
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:Hostbridge: skip %pp map\n",
@@ -1523,7 +1654,9 @@ static int domain_context_mapping(struct
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
+                                         pdev, mode);
+        if ( ret > 0 )
+            ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             enable_ats_device(pdev, &drhd->iommu->ats_devices);
 
@@ -1538,9 +1671,10 @@ static int domain_context_mapping(struct
                    domain, &PCI_SBDF3(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
-        if ( ret )
+                                         pdev, mode);
+        if ( ret < 0 )
             break;
+        prev_present = ret;
 
         if ( (ret = find_upstream_bridge(seg, &bus, &devfn, &secbus)) < 1 )
         {
@@ -1548,6 +1682,15 @@ static int domain_context_mapping(struct
                 break;
             ret = -ENXIO;
         }
+        /*
+         * Strictly speaking if the device is the only one behind this bridge
+         * and the only one with this (secbus,0,0) tuple, it could be allowed
+         * to be re-assigned regardless of RMRR presence.  But let's deal with
+         * that case only if it is actually found in the wild.
+         */
+        else if ( prev_present && (mode & MAP_WITH_RMRR) &&
+                  domain != pdev->domain )
+            ret = -EOPNOTSUPP;
 
         /*
          * Mapping a bridge should, if anything, pass the struct pci_dev of
@@ -1556,7 +1699,7 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL);
+                                             NULL, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1571,10 +1714,15 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL);
+                                             NULL, mode);
 
         if ( ret )
-            domain_context_unmap(domain, devfn, pdev);
+        {
+            if ( !prev_present )
+                domain_context_unmap(domain, devfn, pdev);
+            else if ( pdev->domain != domain ) /* Avoid infinite recursion. */
+                domain_context_mapping(pdev->domain, devfn, pdev);
+        }
 
         break;
 
@@ -2363,17 +2511,46 @@ static int reassign_device_ownership(
 {
     int ret;
 
-    ret = domain_context_unmap(source, devfn, pdev);
+    if ( !QUARANTINE_SKIP(target) )
+    {
+        if ( !has_arch_pdevs(target) )
+            vmx_pi_hooks_assign(target);
+
+        /*
+         * Devices assigned to untrusted domains (here assumed to be any domU)
+         * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
+         * by the root complex unless interrupt remapping is enabled.
+         */
+        if ( (target != hardware_domain) && !iommu_intremap )
+            untrusted_msi = true;
+
+        ret = domain_context_mapping(target, devfn, pdev);
+
+        if ( !ret && !QUARANTINE_SKIP(source) && pdev->devfn == devfn )
+        {
+            const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+
+            if ( drhd )
+                check_cleanup_domid_map(source, pdev, drhd->iommu);
+        }
+    }
+    else
+        ret = domain_context_unmap(source, devfn, pdev);
     if ( ret )
+    {
+        if ( !has_arch_pdevs(target) )
+            vmx_pi_hooks_deassign(target);
         return ret;
+    }
 
-    /*
-     * Devices assigned to untrusted domains (here assumed to be any domU)
-     * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
-     * by the root complex unless interrupt remapping is enabled.
-     */
-    if ( (target != hardware_domain) && !iommu_intremap )
-        untrusted_msi = true;
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
+
+    if ( !has_arch_pdevs(source) )
+        vmx_pi_hooks_deassign(source);
 
     /*
      * If the device belongs to the hardware domain, and it has RMRR, don't
@@ -2403,34 +2580,7 @@ static int reassign_device_ownership(
             }
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    if ( !has_arch_pdevs(source) )
-        vmx_pi_hooks_deassign(source);
-
-    if ( !has_arch_pdevs(target) )
-        vmx_pi_hooks_assign(target);
-
-    ret = domain_context_mapping(target, devfn, pdev);
-    if ( ret )
-    {
-        if ( !has_arch_pdevs(target) )
-            vmx_pi_hooks_deassign(target);
-
-        return ret;
-    }
-
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
-    }
-
-    return ret;
+    return 0;
 }
 
 static int intel_iommu_assign_device(
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -202,8 +202,12 @@ struct root_entry {
     do {(root).val |= ((value) & PAGE_MASK_4K);} while(0)
 
 struct context_entry {
-    u64 lo;
-    u64 hi;
+    union {
+        struct {
+            uint64_t lo, hi;
+        };
+        __uint128_t full;
+    };
 };
 #define ROOT_ENTRY_NR (PAGE_SIZE_4K/sizeof(struct root_entry))
 #define context_present(c) ((c).lo & 1)
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -407,7 +407,8 @@ void __init platform_quirks_init(void)
  */
 
 static int __must_check map_me_phantom_function(struct domain *domain,
-                                                u32 dev, int map)
+                                                unsigned int dev,
+                                                unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
     struct pci_dev *pdev;
@@ -418,9 +419,9 @@ static int __must_check map_me_phantom_f
     drhd = acpi_find_matched_drhd_unit(pdev);
 
     /* map or unmap ME phantom function */
-    if ( map )
+    if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL);
+                                        PCI_DEVFN(dev, 7), NULL, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
                                       PCI_DEVFN(dev, 7));
@@ -428,7 +429,8 @@ static int __must_check map_me_phantom_f
     return rc;
 }
 
-int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
+int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
+                  unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -452,7 +454,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, map);
+                rc = map_me_phantom_function(domain, 3, mode);
                 break;
             default:
                 break;
@@ -478,7 +480,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, map);
+                rc = map_me_phantom_function(domain, 22, mode);
                 break;
             default:
                 break;
--- a/xen/drivers/passthrough/vtd/vtd.h
+++ b/xen/drivers/passthrough/vtd/vtd.h
@@ -22,8 +22,14 @@
 
 #include <xen/iommu.h>
 
-#define MAP_ME_PHANTOM_FUNC      1
-#define UNMAP_ME_PHANTOM_FUNC    0
+/*
+ * Values for domain_context_mapping_one()'s and me_wifi_quirk()'s "mode"
+ * parameters.
+ */
+#define MAP_WITH_RMRR         (1u << 0)
+#define MAP_OWNER_DYING       (1u << 1)
+#define MAP_SINGLE_DEVICE     (1u << 2)
+#define UNMAP_ME_PHANTOM_FUNC (1u << 3)
 
 /* Allow for both IOAPIC and IOSAPIC. */
 #define IO_xAPIC_route_entry IO_APIC_route_entry
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-assign devices directly

Devices with unity map ranges, due to it being unspecified how/when
these memory ranges may get accessed, may not be left disconnected from
their unity mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than tearing down the old root page
table pointer and then establishing the new one, re-assignment needs to
be done in a single step.

This is CVE-2022-26360 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any unity map
ranges.  The main difference is when it comes to updating DTEs, which need
to be atomic when there are unity mappings. Yet atomicity can only be
achieved with CMPXCHG16B, availability of which we can't take for given.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -262,9 +262,13 @@ void amd_iommu_set_intremap_table(struct
                                   const void *ptr,
                                   const struct amd_iommu *iommu,
                                   bool valid);
-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
-				   uint64_t root_ptr, uint16_t domain_id,
-				   uint8_t paging_mode, bool valid);
+#define SET_ROOT_VALID          (1u << 0)
+#define SET_ROOT_WITH_UNITY_MAP (1u << 1)
+int __must_check amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+                                               uint64_t root_ptr,
+                                               uint16_t domain_id,
+                                               uint8_t paging_mode,
+                                               unsigned int flags);
 void iommu_dte_add_device_entry(struct amd_iommu_dte *dte,
                                 const struct ivrs_mappings *ivrs_dev);
 
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -114,10 +114,69 @@ static unsigned int set_iommu_ptes_prese
     return flush_flags;
 }
 
-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
-                                   uint64_t root_ptr, uint16_t domain_id,
-                                   uint8_t paging_mode, bool valid)
+/*
+ * This function returns
+ * - -errno for errors,
+ * - 0 for a successful update, atomic when necessary
+ * - 1 for a successful but non-atomic update, which may need to be warned
+ *   about by the caller.
+ */
+int amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+                                  uint64_t root_ptr, uint16_t domain_id,
+                                  uint8_t paging_mode, unsigned int flags)
 {
+    bool valid = flags & SET_ROOT_VALID;
+
+    if ( dte->v && dte->tv &&
+         (cpu_has_cx16 || (flags & SET_ROOT_WITH_UNITY_MAP)) )
+    {
+        union {
+            struct amd_iommu_dte dte;
+            uint64_t raw64[4];
+            __uint128_t raw128[2];
+        } ldte = { .dte = *dte };
+        __uint128_t old = ldte.raw128[0];
+        int ret = 0;
+
+        ldte.dte.domain_id = domain_id;
+        ldte.dte.pt_root = paddr_to_pfn(root_ptr);
+        ldte.dte.iw = true;
+        ldte.dte.ir = true;
+        ldte.dte.paging_mode = paging_mode;
+        ldte.dte.v = valid;
+
+        if ( cpu_has_cx16 )
+        {
+            __uint128_t res = cmpxchg16b(dte, &old, &ldte.raw128[0]);
+
+            /*
+             * Hardware does not update the DTE behind our backs, so the
+             * return value should match "old".
+             */
+            if ( res != old )
+            {
+                printk(XENLOG_ERR
+                       "Dom%d: unexpected DTE %016lx_%016lx (expected %016lx_%016lx)\n",
+                       domain_id,
+                       (uint64_t)(res >> 64), (uint64_t)res,
+                       (uint64_t)(old >> 64), (uint64_t)old);
+                ret = -EILSEQ;
+            }
+        }
+        else /* Best effort, updating domain_id last. */
+        {
+            uint64_t *ptr = (void *)dte;
+
+            write_atomic(ptr + 0, ldte.raw64[0]);
+            /* No barrier should be needed between these two. */
+            write_atomic(ptr + 1, ldte.raw64[1]);
+
+            ret = 1;
+        }
+
+        return ret;
+    }
+
     if ( valid || dte->v )
     {
         dte->tv = false;
@@ -132,6 +191,8 @@ void amd_iommu_set_root_page_table(struc
     smp_wmb();
     dte->tv = true;
     dte->v = valid;
+
+    return 0;
 }
 
 void amd_iommu_set_intremap_table(
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -96,13 +96,32 @@ static int __must_check allocate_domain_
     return rc;
 }
 
+static bool any_pdev_behind_iommu(const struct domain *d,
+                                  const struct pci_dev *exclude,
+                                  const struct amd_iommu *iommu)
+{
+    const struct pci_dev *pdev;
+
+    for_each_pdev ( d, pdev )
+    {
+        if ( pdev == exclude )
+            continue;
+
+        if ( find_iommu_for_device(pdev->seg, pdev->sbdf.bdf) == iommu )
+            return true;
+    }
+
+    return false;
+}
+
 static int __must_check amd_iommu_setup_domain_device(
     struct domain *domain, struct amd_iommu *iommu,
     uint8_t devfn, struct pci_dev *pdev)
 {
     struct amd_iommu_dte *table, *dte;
     unsigned long flags;
-    int req_id, valid = 1, rc;
+    unsigned int req_id, sr_flags;
+    int rc;
     u8 bus = pdev->bus;
     struct domain_iommu *hd = dom_iommu(domain);
     const struct ivrs_mappings *ivrs_dev;
@@ -116,8 +135,11 @@ static int __must_check amd_iommu_setup_
     if ( rc )
         return rc;
 
-    if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
-        valid = 0;
+    req_id = get_dma_requestor_id(iommu->seg, pdev->sbdf.bdf);
+    ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
+    sr_flags = (iommu_hwdom_passthrough && is_hardware_domain(domain)
+                ? 0 : SET_ROOT_VALID)
+               | (ivrs_dev->unity_map ? SET_ROOT_WITH_UNITY_MAP : 0);
 
     /* get device-table entry */
     req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
@@ -130,9 +152,15 @@ static int __must_check amd_iommu_setup_
     if ( !dte->v || !dte->tv )
     {
         /* bind DTE to domain page-tables */
-        amd_iommu_set_root_page_table(
-            dte, page_to_maddr(hd->arch.amd.root_table),
-            domain->domain_id, hd->arch.amd.paging_mode, valid);
+        rc = amd_iommu_set_root_page_table(
+                 dte, page_to_maddr(hd->arch.amd.root_table),
+                 domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+        if ( rc )
+        {
+            ASSERT(rc < 0);
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
 
         /* Undo what amd_iommu_disable_domain_device() may have done. */
         if ( dte->it_root )
@@ -152,17 +180,76 @@ static int __must_check amd_iommu_setup_
         spin_unlock_irqrestore(&iommu->lock, flags);
 
         amd_iommu_flush_device(iommu, req_id);
+    }
+    else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.amd.root_table)) )
+    {
+        /*
+         * Strictly speaking if the device is the only one with this requestor
+         * ID, it could be allowed to be re-assigned regardless of unity map
+         * presence.  But let's deal with that case only if it is actually
+         * found in the wild.
+         */
+        if ( req_id != PCI_BDF2(bus, devfn) &&
+             (sr_flags & SET_ROOT_WITH_UNITY_MAP) )
+            rc = -EOPNOTSUPP;
+        else
+            rc = amd_iommu_set_root_page_table(
+                     dte, page_to_maddr(hd->arch.amd.root_table),
+                     domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+        if ( rc < 0 )
+        {
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
+        if ( rc &&
+             domain != pdev->domain &&
+             /*
+              * By non-atomically updating the DTE's domain ID field last,
+              * during a short window in time TLB entries with the old domain
+              * ID but the new page tables may have been inserted.  This could
+              * affect I/O of other devices using this same (old) domain ID.
+              * Such updating therefore is not a problem if this was the only
+              * device associated with the old domain ID.  Diverting I/O of any
+              * of a dying domain's devices to the quarantine page tables is
+              * intended anyway.
+              */
+             !pdev->domain->is_dying &&
+             (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
+              pdev->phantom_stride) )
+            AMD_IOMMU_WARN(" %pp: reassignment may cause %pd data corruption\n",
+                           &PCI_SBDF3(pdev->seg, bus, devfn), pdev->domain);
+
+        /*
+         * Check remaining settings are still in place from an earlier call
+         * here. They're all independent of the domain, so should not have
+         * changed.
+         */
+        if ( dte->it_root )
+            ASSERT(dte->int_ctl == IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED);
+        ASSERT(dte->iv == iommu_intremap);
+        ASSERT(dte->ex == ivrs_dev->dte_allow_exclusion);
+        ASSERT(dte->sys_mgt == MASK_EXTR(ivrs_dev->device_flags,
+                                         ACPI_IVHD_SYSTEM_MGMT));
 
-        AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
-                        "root table = %#"PRIx64", "
-                        "domain = %d, paging mode = %d\n",
-                        req_id, pdev->type,
-                        page_to_maddr(hd->arch.amd.root_table),
-                        domain->domain_id, hd->arch.amd.paging_mode);
+        if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
+             !ivrs_dev->block_ats &&
+             iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) )
+            ASSERT(dte->i == ats_enabled);
+
+        spin_unlock_irqrestore(&iommu->lock, flags);
+
+        amd_iommu_flush_device(iommu, req_id);
     }
     else
         spin_unlock_irqrestore(&iommu->lock, flags);
 
+    AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
+                    "root table = %#"PRIx64", "
+                    "domain = %d, paging mode = %d\n",
+                    req_id, pdev->type,
+                    page_to_maddr(hd->arch.amd.root_table),
+                    domain->domain_id, hd->arch.amd.paging_mode);
+
     ASSERT(pcidevs_locked());
 
     if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
@@ -366,7 +453,20 @@ static int reassign_device(struct domain
         return -ENODEV;
     }
 
-    amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
+    if ( !QUARANTINE_SKIP(target) )
+    {
+        rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
+        if ( rc )
+            return rc;
+    }
+    else
+        amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
 
     /*
      * If the device belongs to the hardware domain, and it has a unity mapping,
@@ -382,25 +482,9 @@ static int reassign_device(struct domain
             return rc;
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
-    if ( rc )
-        return rc;
-
     AMD_IOMMU_DEBUG("Re-assign %pp from dom%d to dom%d\n",
                     &pdev->sbdf, source->domain_id, target->domain_id);
 
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
-    }
-
     return 0;
 }
 
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part I)

Arrange for domain ID and page table root to be passed around, the latter in
particular to domain_pgd_maddr() such that taking it from the per-domain
fields can be overridden.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -85,9 +85,10 @@ void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(const void *va);
 int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
                                uint8_t bus, uint8_t devfn,
-                               const struct pci_dev *pdev, unsigned int mode);
+                               const struct pci_dev *pdev, domid_t domid,
+                               paddr_t pgd_maddr, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
-                             u8 bus, u8 devfn);
+                             uint8_t bus, uint8_t devfn, domid_t domid);
 int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt);
 
 unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg);
@@ -106,7 +107,8 @@ void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
 int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
-                               uint8_t devfn, unsigned int mode);
+                               uint8_t devfn, domid_t domid, paddr_t pgd_maddr,
+                               unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -43,7 +43,7 @@
 #include "../ats.h"
 
 /* dom_io is used as a sentinel for quarantined devices */
-#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.vtd.pgd_maddr)
+#define QUARANTINE_SKIP(d, pgd_maddr) ((d) == dom_io && !(pgd_maddr))
 
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
@@ -358,15 +358,17 @@ static u64 addr_to_dma_page_maddr(struct
     return pte_maddr;
 }
 
-static uint64_t domain_pgd_maddr(struct domain *d, unsigned int nr_pt_levels)
+static paddr_t domain_pgd_maddr(struct domain *d, paddr_t pgd_maddr,
+                                unsigned int nr_pt_levels)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    uint64_t pgd_maddr;
     unsigned int agaw;
 
     ASSERT(spin_is_locked(&hd->arch.mapping_lock));
 
-    if ( iommu_use_hap_pt(d) )
+    if ( pgd_maddr )
+        /* nothing */;
+    else if ( iommu_use_hap_pt(d) )
     {
         pagetable_t pgt = p2m_get_pagetable(p2m_get_hostp2m(d));
 
@@ -1385,18 +1387,18 @@ int domain_context_mapping_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
     uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
-    unsigned int mode)
+    domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     struct context_entry *context, *context_entries, lctxt;
     __uint128_t old;
-    u64 maddr, pgd_maddr;
+    uint64_t maddr;
     uint16_t seg = iommu->drhd->segment, prev_did = 0;
     struct domain *prev_dom = NULL;
     int rc, ret;
     bool_t flush_dev_iotlb;
 
-    if ( QUARANTINE_SKIP(domain) )
+    if ( QUARANTINE_SKIP(domain, pgd_maddr) )
         return 0;
 
     ASSERT(pcidevs_locked());
@@ -1433,10 +1435,12 @@ int domain_context_mapping_one(
     }
     else
     {
+        paddr_t root;
+
         spin_lock(&hd->arch.mapping_lock);
 
-        pgd_maddr = domain_pgd_maddr(domain, iommu->nr_pt_levels);
-        if ( !pgd_maddr )
+        root = domain_pgd_maddr(domain, pgd_maddr, iommu->nr_pt_levels);
+        if ( !root )
         {
             spin_unlock(&hd->arch.mapping_lock);
             spin_unlock(&iommu->lock);
@@ -1446,7 +1450,7 @@ int domain_context_mapping_one(
             return -ENOMEM;
         }
 
-        context_set_address_root(lctxt, pgd_maddr);
+        context_set_address_root(lctxt, root);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
             context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
@@ -1562,15 +1566,21 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, mode);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode);
 
     if ( rc )
     {
         if ( !prev_dom )
-            ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+            ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                           domain->domain_id);
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+        {
+            hd = dom_iommu(prev_dom);
             ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                             domain->domain_id,
+                                             hd->arch.vtd.pgd_maddr,
                                              mode & MAP_WITH_RMRR) < 0;
+        }
         else
             ret = 1;
 
@@ -1592,6 +1602,7 @@ static int domain_context_mapping(struct
 {
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
     const struct acpi_rmrr_unit *rmrr;
+    paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1654,7 +1665,8 @@ static int domain_context_mapping(struct
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret > 0 )
             ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
@@ -1671,7 +1683,8 @@ static int domain_context_mapping(struct
                    domain, &PCI_SBDF3(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1699,7 +1712,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1714,7 +1728,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         if ( ret )
         {
@@ -1742,14 +1757,14 @@ static int domain_context_mapping(struct
 int domain_context_unmap_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
-    u8 bus, u8 devfn)
+    uint8_t bus, uint8_t devfn, domid_t domid)
 {
     struct context_entry *context, *context_entries;
     u64 maddr;
     int iommu_domid, rc, ret;
     bool_t flush_dev_iotlb;
 
-    if ( QUARANTINE_SKIP(domain) )
+    if ( QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) )
         return 0;
 
     ASSERT(pcidevs_locked());
@@ -1803,7 +1818,7 @@ int domain_context_unmap_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !iommu->drhd->segment && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, 0, UNMAP_ME_PHANTOM_FUNC);
 
     if ( rc && !is_hardware_domain(domain) && domain != dom_io )
     {
@@ -1850,7 +1865,8 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1863,7 +1879,8 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( ret )
             break;
 
@@ -1889,12 +1906,15 @@ static int domain_context_unmap(struct d
         /* PCIe to PCI/PCIx bridge */
         if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
         {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
             if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0);
+                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                               domain->domain_id);
         }
         else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
 
         break;
 
@@ -1904,7 +1924,8 @@ static int domain_context_unmap(struct d
         return -EINVAL;
     }
 
-    if ( !ret && !QUARANTINE_SKIP(domain) && pdev->devfn == devfn )
+    if ( !ret && pdev->devfn == devfn &&
+         !QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) )
         check_cleanup_domid_map(domain, pdev, iommu);
 
     return ret;
@@ -2511,7 +2532,7 @@ static int reassign_device_ownership(
 {
     int ret;
 
-    if ( !QUARANTINE_SKIP(target) )
+    if ( !QUARANTINE_SKIP(target, dom_iommu(target)->arch.vtd.pgd_maddr) )
     {
         if ( !has_arch_pdevs(target) )
             vmx_pi_hooks_assign(target);
@@ -2526,7 +2547,8 @@ static int reassign_device_ownership(
 
         ret = domain_context_mapping(target, devfn, pdev);
 
-        if ( !ret && !QUARANTINE_SKIP(source) && pdev->devfn == devfn )
+        if ( !ret && pdev->devfn == devfn &&
+             !QUARANTINE_SKIP(source, dom_iommu(source)->arch.vtd.pgd_maddr) )
         {
             const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
 
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -408,6 +408,8 @@ void __init platform_quirks_init(void)
 
 static int __must_check map_me_phantom_function(struct domain *domain,
                                                 unsigned int dev,
+                                                domid_t domid,
+                                                paddr_t pgd_maddr,
                                                 unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
@@ -421,16 +423,17 @@ static int __must_check map_me_phantom_f
     /* map or unmap ME phantom function */
     if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL, mode);
+                                        PCI_DEVFN(dev, 7), NULL,
+                                        domid, pgd_maddr, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
-                                      PCI_DEVFN(dev, 7));
+                                      PCI_DEVFN(dev, 7), domid);
 
     return rc;
 }
 
 int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
-                  unsigned int mode)
+                  domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -454,7 +457,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, mode);
+                rc = map_me_phantom_function(domain, 3, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
@@ -480,7 +483,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, mode);
+                rc = map_me_phantom_function(domain, 22, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part II)

Replace the passing of struct domain * by domid_t in preparation of
per-device quarantine page tables also requiring per-device pseudo
domain IDs, which aren't going to be associated with any struct domain
instances.

No functional change intended (except for slightly adjusted log message
text).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -62,8 +62,8 @@ static struct tasklet vtd_fault_tasklet;
 static int setup_hwdom_device(u8 devfn, struct pci_dev *);
 static void setup_hwdom_rmrr(struct domain *d);
 
-static int domain_iommu_domid(struct domain *d,
-                              struct vtd_iommu *iommu)
+static int get_iommu_did(domid_t domid, const struct vtd_iommu *iommu,
+                         bool warn)
 {
     unsigned long nr_dom, i;
 
@@ -71,16 +71,16 @@ static int domain_iommu_domid(struct dom
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
             return i;
 
         i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1);
     }
 
-    if ( !d->is_dying )
+    if ( warn )
         dprintk(XENLOG_ERR VTDPREFIX,
-                "Cannot get valid iommu %u domid: %pd\n",
-                iommu->index, d);
+                "No valid iommu %u domid for Dom%d\n",
+                iommu->index, domid);
 
     return -1;
 }
@@ -88,8 +88,7 @@ static int domain_iommu_domid(struct dom
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
 static int context_set_domain_id(struct context_entry *context,
-                                 struct domain *d,
-                                 struct vtd_iommu *iommu)
+                                 domid_t domid, struct vtd_iommu *iommu)
 {
     unsigned long nr_dom, i;
     int found = 0;
@@ -100,7 +99,7 @@ static int context_set_domain_id(struct
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
         {
             found = 1;
             break;
@@ -116,7 +115,7 @@ static int context_set_domain_id(struct
             dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: no free domain ids\n");
             return -EFAULT;
         }
-        iommu->domid_map[i] = d->domain_id;
+        iommu->domid_map[i] = domid;
     }
 
     set_bit(i, iommu->domid_bitmap);
@@ -125,9 +124,9 @@ static int context_set_domain_id(struct
     return 0;
 }
 
-static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
+static void cleanup_domid_map(domid_t domid, struct vtd_iommu *iommu)
 {
-    int iommu_domid = domain_iommu_domid(domain, iommu);
+    int iommu_domid = get_iommu_did(domid, iommu, false);
 
     if ( iommu_domid >= 0 )
     {
@@ -167,7 +166,7 @@ static bool any_pdev_behind_iommu(const
  * If no other devices under the same iommu owned by this domain,
  * clear iommu in iommu_bitmap and clear domain_id in domid_bitmap.
  */
-static void check_cleanup_domid_map(struct domain *d,
+static void check_cleanup_domid_map(const struct domain *d,
                                     const struct pci_dev *exclude,
                                     struct vtd_iommu *iommu)
 {
@@ -183,7 +182,7 @@ static void check_cleanup_domid_map(stru
     if ( !found )
     {
         clear_bit(iommu->index, dom_iommu(d)->arch.vtd.iommu_bitmap);
-        cleanup_domid_map(d, iommu);
+        cleanup_domid_map(d->domain_id, iommu);
     }
 }
 
@@ -683,7 +682,7 @@ static int __must_check iommu_flush_iotl
             continue;
 
         flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-        iommu_domid= domain_iommu_domid(d, iommu);
+        iommu_domid = get_iommu_did(d->domain_id, iommu, !d->is_dying);
         if ( iommu_domid == -1 )
             continue;
 
@@ -1459,7 +1458,7 @@ int domain_context_mapping_one(
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    if ( context_set_domain_id(&lctxt, domain, iommu) )
+    if ( context_set_domain_id(&lctxt, domid, iommu) )
     {
     unlock:
         spin_unlock(&iommu->lock);
@@ -1785,7 +1784,7 @@ int domain_context_unmap_one(
     context_clear_entry(*context);
     iommu_sync_cache(context, sizeof(struct context_entry));
 
-    iommu_domid= domain_iommu_domid(domain, iommu);
+    iommu_domid = get_iommu_did(domid, iommu, !domain->is_dying);
     if ( iommu_domid == -1 )
     {
         spin_unlock(&iommu->lock);
@@ -1953,7 +1952,7 @@ static void iommu_domain_teardown(struct
     ASSERT(!hd->arch.vtd.pgd_maddr);
 
     for_each_drhd_unit ( drhd )
-        cleanup_domid_map(d, drhd->iommu);
+        cleanup_domid_map(d->domain_id, drhd->iommu);
 
     XFREE(hd->arch.vtd.iommu_bitmap);
 }
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: maintain a per-device pseudo domain ID

In order to subsequently enable per-device quarantine page tables, we'll
need domain-ID-like identifiers to be inserted in the respective device
(AMD) or context (Intel) table entries alongside the per-device page
table root addresses.

Make use of "real" domain IDs occupying only half of the value range
coverable by domid_t.

Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
leaks in case of error, but existing ones don't get plugged - that'll be
the subject of a later change.

The VT-d changes are slightly asymmetric, but this way we can avoid
assigning pseudo domain IDs to devices which would never be mapped while
still avoiding to add a new parameter to domain_context_unmap().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -141,6 +141,10 @@ int pi_update_irte(const struct pi_desc
         iommu_vcall(ops, sync_cache, addr, size);       \
 })
 
+unsigned long *iommu_init_domid(void);
+domid_t iommu_alloc_domid(unsigned long *map);
+void iommu_free_domid(domid_t domid, unsigned long *map);
+
 int __must_check iommu_free_pgtables(struct domain *d);
 struct domain_iommu;
 struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
--- a/xen/include/asm-x86/pci.h
+++ b/xen/include/asm-x86/pci.h
@@ -13,6 +13,12 @@
 
 struct arch_pci_dev {
     vmask_t used_vectors;
+    /*
+     * These fields are (de)initialized under pcidevs-lock. Other uses of
+     * them don't race (de)initialization and hence don't strictly need any
+     * locking.
+     */
+    domid_t pseudo_domid;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
@@ -36,6 +42,6 @@ static always_inline bool is_pci_passthr
     return true;
 }
 
-static inline void arch_pci_init_pdev(struct pci_dev *pdev) {}
+void arch_pci_init_pdev(struct pci_dev *pdev);
 
 #endif /* __X86_PCI_H__ */
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -96,6 +96,7 @@ struct amd_iommu {
     struct ring_buffer cmd_buffer;
     struct ring_buffer event_log;
     struct ring_buffer ppr_log;
+    unsigned long *domid_map;
 
     int exclusion_enable;
     int exclusion_allow_all;
--- a/xen/drivers/passthrough/amd/iommu_detect.c
+++ b/xen/drivers/passthrough/amd/iommu_detect.c
@@ -223,6 +223,11 @@ int __init amd_iommu_detect_one_acpi(
     if ( rt )
         goto out;
 
+    iommu->domid_map = iommu_init_domid();
+    rt = -ENOMEM;
+    if ( !iommu->domid_map )
+        goto out;
+
     rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
     if ( rt )
         printk(XENLOG_ERR "Could not mark config space of %pp read-only (%d)\n",
@@ -233,7 +238,10 @@ int __init amd_iommu_detect_one_acpi(
 
  out:
     if ( rt )
+    {
+        xfree(iommu->domid_map);
         xfree(iommu);
+    }
 
     return rt;
 }
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -539,6 +539,8 @@ static int amd_iommu_add_device(u8 devfn
     struct amd_iommu *iommu;
     u16 bdf;
     struct ivrs_mappings *ivrs_mappings;
+    bool fresh_domid = false;
+    int ret;
 
     if ( !pdev->domain )
         return -EINVAL;
@@ -606,7 +608,22 @@ static int amd_iommu_add_device(u8 devfn
         AMD_IOMMU_WARN("%pd: unity mapping failed for %pp\n",
                        pdev->domain, &pdev->sbdf);
 
-    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( iommu_quarantine && pdev->arch.pseudo_domid == DOMID_INVALID )
+    {
+        pdev->arch.pseudo_domid = iommu_alloc_domid(iommu->domid_map);
+        if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+            return -ENOSPC;
+        fresh_domid = true;
+    }
+
+    ret = amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( ret && fresh_domid )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
+    return ret;
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
@@ -638,6 +655,9 @@ static int amd_iommu_remove_device(u8 de
         AMD_IOMMU_WARN("%pd: unity unmapping failed for %pp\n",
                        pdev->domain, &pdev->sbdf);
 
+    iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+    pdev->arch.pseudo_domid = DOMID_INVALID;
+
     if ( amd_iommu_perdev_intremap &&
          ivrs_mappings[bdf].dte_requestor_id == bdf &&
          ivrs_mappings[bdf].intremap_table )
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1271,9 +1271,14 @@ static int _dump_pci_devices(struct pci_
 
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
     {
-        printk("%pp - %pd - node %-3d",
-               &pdev->sbdf, pdev->domain,
-               (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
+        printk("%pp - ", &pdev->sbdf);
+#ifdef CONFIG_X86
+        if ( pdev->domain == dom_io )
+            printk("DomIO:%x", pdev->arch.pseudo_domid);
+        else
+#endif
+            printk("%pd", pdev->domain);
+        printk(" - node %-3d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
         pdev_dump_msi(pdev);
         printk("\n");
     }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -22,6 +22,7 @@
 #include <xen/sched.h>
 #include <xen/xmalloc.h>
 #include <xen/domain_page.h>
+#include <xen/err.h>
 #include <xen/iocap.h>
 #include <xen/iommu.h>
 #include <xen/numa.h>
@@ -1215,7 +1216,7 @@ int __init iommu_alloc(struct acpi_drhd_
 {
     struct vtd_iommu *iommu;
     unsigned long sagaw, nr_dom;
-    int agaw;
+    int agaw, rc;
 
     iommu = xzalloc(struct vtd_iommu);
     if ( iommu == NULL )
@@ -1301,7 +1302,16 @@ int __init iommu_alloc(struct acpi_drhd_
     if ( !iommu->domid_map )
         return -ENOMEM;
 
+    iommu->pseudo_domid_map = iommu_init_domid();
+    rc = -ENOMEM;
+    if ( !iommu->pseudo_domid_map )
+        goto free;
+
     return 0;
+
+ free:
+    iommu_free(drhd);
+    return rc;
 }
 
 void __init iommu_free(struct acpi_drhd_unit *drhd)
@@ -1324,6 +1334,7 @@ void __init iommu_free(struct acpi_drhd_
 
     xfree(iommu->domid_bitmap);
     xfree(iommu->domid_map);
+    xfree(iommu->pseudo_domid_map);
 
     if ( iommu->msi.irq >= 0 )
         destroy_irq(iommu->msi.irq);
@@ -1593,8 +1604,8 @@ int domain_context_mapping_one(
     return rc ?: pdev && prev_dom;
 }
 
-static int domain_context_unmap(struct domain *d, uint8_t devfn,
-                                struct pci_dev *pdev);
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *d, uint8_t devfn, struct pci_dev *pdev);
 
 static int domain_context_mapping(struct domain *domain, u8 devfn,
                                   struct pci_dev *pdev)
@@ -1602,6 +1613,7 @@ static int domain_context_mapping(struct
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
     const struct acpi_rmrr_unit *rmrr;
     paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
+    domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1660,6 +1672,14 @@ static int domain_context_mapping(struct
         if ( !drhd )
             return -ENODEV;
 
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
@@ -1677,6 +1697,14 @@ static int domain_context_mapping(struct
         if ( !drhd )
             return -ENODEV;
 
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCI: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
@@ -1750,6 +1778,13 @@ static int domain_context_mapping(struct
     if ( !ret && devfn == pdev->devfn )
         pci_vtd_quirk(pdev);
 
+    if ( ret && drhd && orig_domid == DOMID_INVALID )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return ret;
 }
 
@@ -1835,8 +1870,10 @@ int domain_context_unmap_one(
     return rc;
 }
 
-static int domain_context_unmap(struct domain *domain, u8 devfn,
-                                struct pci_dev *pdev)
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *domain,
+    uint8_t devfn,
+    struct pci_dev *pdev)
 {
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
     struct vtd_iommu *iommu = drhd ? drhd->iommu : NULL;
@@ -1850,16 +1887,16 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:Hostbridge: skip %pp unmap\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        return is_hardware_domain(domain) ? 0 : -EPERM;
+        return ERR_PTR(is_hardware_domain(domain) ? 0 : -EPERM);
 
     case DEV_TYPE_PCIe_BRIDGE:
     case DEV_TYPE_PCIe2PCI_BRIDGE:
     case DEV_TYPE_LEGACY_PCI_BRIDGE:
-        return 0;
+        return ERR_PTR(0);
 
     case DEV_TYPE_PCIe_ENDPOINT:
         if ( !iommu )
-            return -ENODEV;
+            return ERR_PTR(-ENODEV);
 
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
@@ -1873,7 +1910,7 @@ static int domain_context_unmap(struct d
 
     case DEV_TYPE_PCI:
         if ( !iommu )
-            return -ENODEV;
+            return ERR_PTR(-ENODEV);
 
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
@@ -1920,14 +1957,14 @@ static int domain_context_unmap(struct d
     default:
         dprintk(XENLOG_ERR VTDPREFIX, "%pd:unknown(%u): %pp\n",
                 domain, pdev->type, &PCI_SBDF3(seg, bus, devfn));
-        return -EINVAL;
+        return ERR_PTR(-EINVAL);
     }
 
     if ( !ret && pdev->devfn == devfn &&
          !QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) )
         check_cleanup_domid_map(domain, pdev, iommu);
 
-    return ret;
+    return drhd;
 }
 
 static void iommu_clear_root_pgtable(struct domain *d)
@@ -2154,16 +2191,17 @@ static int intel_iommu_enable_device(str
 
 static int intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
 {
+    const struct acpi_drhd_unit *drhd;
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int ret, i;
+    unsigned int i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_unmap(pdev->domain, devfn, pdev);
-    if ( ret )
-        return ret;
+    drhd = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( IS_ERR(drhd) )
+        return PTR_ERR(drhd);
 
     for_each_rmrr_device ( rmrr, bdf, i )
     {
@@ -2180,6 +2218,13 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
+    if ( drhd )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return 0;
 }
 
@@ -2556,7 +2601,12 @@ static int reassign_device_ownership(
         }
     }
     else
-        ret = domain_context_unmap(source, devfn, pdev);
+    {
+        const struct acpi_drhd_unit *drhd;
+
+        drhd = domain_context_unmap(source, devfn, pdev);
+        ret = IS_ERR(drhd) ? PTR_ERR(drhd) : 0;
+    }
     if ( ret )
     {
         if ( !has_arch_pdevs(target) )
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -508,6 +508,7 @@ struct vtd_iommu {
     } flush;
 
     struct list_head ats_devices;
+    unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */
     unsigned long *domid_bitmap;  /* domain id bitmap */
     u16 *domid_map;               /* domain id mapping array */
     uint32_t version;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -387,6 +387,58 @@ void __hwdom_init arch_iommu_hwdom_init(
         return;
 }
 
+void arch_pci_init_pdev(struct pci_dev *pdev)
+{
+    pdev->arch.pseudo_domid = DOMID_INVALID;
+}
+
+unsigned long *__init iommu_init_domid(void)
+{
+    if ( !iommu_quarantine )
+        return ZERO_BLOCK_PTR;
+
+    BUILD_BUG_ON(DOMID_MASK * 2U >= UINT16_MAX);
+
+    return xzalloc_array(unsigned long,
+                         BITS_TO_LONGS(UINT16_MAX - DOMID_MASK));
+}
+
+domid_t iommu_alloc_domid(unsigned long *map)
+{
+    /*
+     * This is used uniformly across all IOMMUs, such that on typical
+     * systems we wouldn't re-use the same ID very quickly (perhaps never).
+     */
+    static unsigned int start;
+    unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start);
+
+    ASSERT(pcidevs_locked());
+
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK);
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        return DOMID_INVALID;
+
+    __set_bit(idx, map);
+
+    start = idx + 1;
+
+    return idx | (DOMID_MASK + 1);
+}
+
+void iommu_free_domid(domid_t domid, unsigned long *map)
+{
+    ASSERT(pcidevs_locked());
+
+    if ( domid == DOMID_INVALID )
+        return;
+
+    ASSERT(domid > DOMID_MASK);
+
+    if ( !__test_and_clear_bit(domid & DOMID_MASK, map) )
+        BUG();
+}
+
 int iommu_free_pgtables(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: drop TLB flushes from quarantine_init() hooks

The page tables just created aren't hooked up yet anywhere, so there's
nothing that could be present in any TLB, and hence nothing to flush.
Dropping this flush is, at least on the VT-d side, a prereq to per-
device domain ID use when quarantining devices, as dom_io isn't going
to be assigned a DID anymore: The warning in get_iommu_did() would
trigger.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -654,8 +654,6 @@ int __init amd_iommu_quarantine_init(str
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    amd_iommu_flush_all_pages(d);
-
     /* Pages leaked in failure case */
     return level ? -ENOMEM : 0;
 }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2975,9 +2975,6 @@ static int __init intel_iommu_quarantine
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    if ( !rc )
-        rc = iommu_flush_iotlb_all(d);
-
     /* Pages may be leaked in failure case */
     return rc;
 }
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: abstract maximum number of page table levels

We will want to use the constant elsewhere.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

--- a/xen/drivers/passthrough/amd/iommu-defs.h
+++ b/xen/drivers/passthrough/amd/iommu-defs.h
@@ -106,6 +106,7 @@ struct amd_iommu_dte {
     bool tv:1;
     unsigned int :5;
     unsigned int had:2;
+#define IOMMU_MAX_PT_LEVELS 6
     unsigned int paging_mode:3;
     uint64_t pt_root:40;
     bool ppr:1;
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -337,7 +337,7 @@ int amd_iommu_alloc_root(struct domain *
     return 0;
 }
 
-unsigned int __read_mostly amd_iommu_max_paging_mode = 6;
+unsigned int __read_mostly amd_iommu_max_paging_mode = IOMMU_MAX_PT_LEVELS;
 int __read_mostly amd_iommu_min_paging_mode = 1;
 
 static int amd_iommu_domain_init(struct domain *d)
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: use per-device page tables for quarantining

Devices with RMRRs / unity mapped regions, due to it being unspecified
how/when these memory regions may be accessed, may not be left
disconnected from the mappings of these regions (as long as it's not
certain that the device has been fully quiesced). Hence even the page
tables used when quarantining such devices need to have mappings of
those regions. This implies installing page tables in the first place
even when not in scratch-page quarantining mode.

This is CVE-2022-26361 / part of XSA-400.

While for the purpose here it would be sufficient to have devices with
RMRRs / unity mapped regions use per-device page tables, extend this to
all devices (in scratch-page quarantining mode). This allows the leaf
pages to be mapped r/w, thus covering also memory writes (rather than
just reads) issued by non-quiescent devices.

Set up quarantine page tables as late as possible, yet early enough to
not encounter failure during de-assign. This means setup generally
happens in assign_device(), while (for now) the one in deassign_device()
is there mainly to be on the safe side.

As to the removal of QUARANTINE_SKIP() from domain_context_unmap_one():
I think this was never really needed there, as the function explicitly
deals with finding a non-present context entry. Leaving it there would
require propagating pgd_maddr into the function (like was done by "VT-d:
prepare for per-device quarantine page tables" for
domain_context_mapping_one()).

In VT-d's DID allocation function don't require the IOMMU lock to be
held anymore: All involved code paths hold pcidevs_lock, so this way we
avoid the need to acquire the IOMMU lock around the new call to
context_set_domain_id().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/include/asm-x86/pci.h
+++ b/xen/include/asm-x86/pci.h
@@ -1,6 +1,8 @@
 #ifndef __X86_PCI_H__
 #define __X86_PCI_H__
 
+#include <xen/mm.h>
+
 #define CF8_BDF(cf8)     (  ((cf8) & 0x00ffff00) >> 8)
 #define CF8_ADDR_LO(cf8) (   (cf8) & 0x000000fc)
 #define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
@@ -18,7 +20,18 @@ struct arch_pci_dev {
      * them don't race (de)initialization and hence don't strictly need any
      * locking.
      */
+    union {
+        /* Subset of struct arch_iommu's fields, to be used in dom_io. */
+        struct {
+            uint64_t pgd_maddr;
+        } vtd;
+        struct {
+            struct page_info *root_table;
+        } amd;
+    };
     domid_t pseudo_domid;
+    mfn_t leaf_mfn;
+    struct page_list_head pgtables_list;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -237,7 +237,8 @@ int amd_iommu_init_late(void);
 int amd_iommu_update_ivrs_mapping_acpi(void);
 int iov_adjust_irq_affinities(void);
 
-int amd_iommu_quarantine_init(struct domain *d);
+int amd_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page);
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev);
 
 /* mapping functions */
 int __must_check amd_iommu_map_page(struct domain *d, dfn_t dfn,
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -598,64 +598,138 @@ int amd_iommu_get_reserved_device_memory
     return 0;
 }
 
-int __init amd_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(union amd_iommu_pte *this, unsigned int level,
+                    struct page_info *pgs[IOMMU_MAX_PT_LEVELS])
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_PER_TABLE_SIZE; ++i )
+    {
+        union amd_iommu_pte *pte = &this[i], *next;
+
+        if ( !pte->pr )
+        {
+            if ( !pgs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                pgs[level] = iommu_alloc_pgtable(hd);
+                if ( !pgs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                if ( level )
+                {
+                    next = __map_domain_page(pgs[level]);
+                    rc = fill_qpt(next, level - 1, pgs);
+                    unmap_domain_page(next);
+                }
+            }
+
+            /*
+             * PDEs are essentially a subset of PTEs, so this function
+             * is fine to use even at the leaf.
+             */
+            set_iommu_pde_present(pte, mfn_x(page_to_mfn(pgs[level])), level,
+                                  true, true);
+        }
+        else if ( level && pte->next_level )
+        {
+            next = map_domain_page(_mfn(pte->mfn));
+            rc = fill_qpt(next, level - 1, pgs);
+            unmap_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+int amd_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
     unsigned long end_gfn =
         1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT);
     unsigned int level = amd_iommu_get_paging_mode(end_gfn);
-    union amd_iommu_pte *table;
+    unsigned int req_id = get_dma_requestor_id(pdev->seg, pdev->sbdf.bdf);
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
+    int rc;
+
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.amd.root_table);
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
 
-    if ( hd->arch.amd.root_table )
-    {
-        ASSERT_UNREACHABLE();
+    if ( !scratch_page && !ivrs_mappings[req_id].unity_map )
         return 0;
-    }
 
-    spin_lock(&hd->arch.mapping_lock);
+    ASSERT(pdev->arch.pseudo_domid != DOMID_INVALID);
 
-    hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
-    if ( !hd->arch.amd.root_table )
-        goto out;
+    if ( pdev->arch.amd.root_table )
+    {
+        clear_domain_page(pdev->arch.leaf_mfn);
+        return 0;
+    }
 
-    table = __map_domain_page(hd->arch.amd.root_table);
-    while ( level )
+    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd);
+    if ( !pdev->arch.amd.root_table )
+        return -ENOMEM;
+
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
+    hd->arch.amd.root_table = pdev->arch.amd.root_table;
+
+    rc = amd_iommu_reserve_domain_unity_map(dom_io,
+                                            ivrs_mappings[req_id].unity_map,
+                                            0);
+
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.amd.root_table = NULL;
+
+    if ( rc )
+        AMD_IOMMU_WARN("%pp: quarantine unity mapping failed\n", &pdev->sbdf);
+    else if ( scratch_page )
     {
-        struct page_info *pg;
-        unsigned int i;
+        union amd_iommu_pte *root;
+        struct page_info *pgs[IOMMU_MAX_PT_LEVELS] = {};
 
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        pg = iommu_alloc_pgtable(hd);
-        if ( !pg )
-            break;
+        spin_lock(&hd->arch.mapping_lock);
 
-        for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
-        {
-            union amd_iommu_pte *pde = &table[i];
+        root = __map_domain_page(pdev->arch.amd.root_table);
+        rc = fill_qpt(root, level - 1, pgs);
+        unmap_domain_page(root);
 
-            /*
-             * PDEs are essentially a subset of PTEs, so this function
-             * is fine to use even at the leaf.
-             */
-            set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level - 1,
-                                  false, true);
-        }
+        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
 
-        unmap_domain_page(table);
-        table = __map_domain_page(pg);
-        level--;
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_domain_page(table);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list);
+
+    if ( rc )
+        amd_iommu_quarantine_teardown(pdev);
+
+    return rc;
+}
+
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.amd.root_table )
+        return;
 
-    /* Pages leaked in failure case */
-    return level ? -ENOMEM : 0;
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
+    page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list);
+    while ( iommu_free_pgtables(dom_io) == -ERESTART )
+        /* nothing */;
+    pdev->arch.amd.root_table = NULL;
 }
 
 /*
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -26,7 +26,7 @@
 #include "../ats.h"
 
 /* dom_io is used as a sentinel for quarantined devices */
-#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.amd.root_table)
+#define QUARANTINE_SKIP(d, p) ((d) == dom_io && !(p)->arch.amd.root_table)
 
 static bool_t __read_mostly init_done;
 
@@ -125,8 +125,10 @@ static int __must_check amd_iommu_setup_
     u8 bus = pdev->bus;
     struct domain_iommu *hd = dom_iommu(domain);
     const struct ivrs_mappings *ivrs_dev;
+    const struct page_info *root_pg;
+    domid_t domid;
 
-    if ( QUARANTINE_SKIP(domain) )
+    if ( QUARANTINE_SKIP(domain, pdev) )
         return 0;
 
     BUG_ON(!hd->arch.amd.paging_mode || !iommu->dev_table.buffer);
@@ -147,14 +149,25 @@ static int __must_check amd_iommu_setup_
     dte = &table[req_id];
     ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
+    if ( domain != dom_io )
+    {
+        root_pg = hd->arch.amd.root_table;
+        domid = domain->domain_id;
+    }
+    else
+    {
+        root_pg = pdev->arch.amd.root_table;
+        domid = pdev->arch.pseudo_domid;
+    }
+
     spin_lock_irqsave(&iommu->lock, flags);
 
     if ( !dte->v || !dte->tv )
     {
         /* bind DTE to domain page-tables */
         rc = amd_iommu_set_root_page_table(
-                 dte, page_to_maddr(hd->arch.amd.root_table),
-                 domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+                 dte, page_to_maddr(root_pg), domid,
+                 hd->arch.amd.paging_mode, sr_flags);
         if ( rc )
         {
             ASSERT(rc < 0);
@@ -181,7 +194,7 @@ static int __must_check amd_iommu_setup_
 
         amd_iommu_flush_device(iommu, req_id);
     }
-    else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.amd.root_table)) )
+    else if ( dte->pt_root != mfn_x(page_to_mfn(root_pg)) )
     {
         /*
          * Strictly speaking if the device is the only one with this requestor
@@ -194,8 +207,8 @@ static int __must_check amd_iommu_setup_
             rc = -EOPNOTSUPP;
         else
             rc = amd_iommu_set_root_page_table(
-                     dte, page_to_maddr(hd->arch.amd.root_table),
-                     domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+                     dte, page_to_maddr(root_pg), domid,
+                     hd->arch.amd.paging_mode, sr_flags);
         if ( rc < 0 )
         {
             spin_unlock_irqrestore(&iommu->lock, flags);
@@ -214,6 +227,7 @@ static int __must_check amd_iommu_setup_
               * intended anyway.
               */
              !pdev->domain->is_dying &&
+             pdev->domain != dom_io &&
              (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
               pdev->phantom_stride) )
             AMD_IOMMU_WARN(" %pp: reassignment may cause %pd data corruption\n",
@@ -246,9 +260,8 @@ static int __must_check amd_iommu_setup_
     AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
                     "root table = %#"PRIx64", "
                     "domain = %d, paging mode = %d\n",
-                    req_id, pdev->type,
-                    page_to_maddr(hd->arch.amd.root_table),
-                    domain->domain_id, hd->arch.amd.paging_mode);
+                    req_id, pdev->type, page_to_maddr(root_pg),
+                    domid, hd->arch.amd.paging_mode);
 
     ASSERT(pcidevs_locked());
 
@@ -327,7 +340,7 @@ int amd_iommu_alloc_root(struct domain *
 {
     struct domain_iommu *hd = dom_iommu(d);
 
-    if ( unlikely(!hd->arch.amd.root_table) )
+    if ( unlikely(!hd->arch.amd.root_table) && d != dom_io )
     {
         hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
         if ( !hd->arch.amd.root_table )
@@ -391,7 +404,7 @@ static void amd_iommu_disable_domain_dev
     int req_id;
     u8 bus = pdev->bus;
 
-    if ( QUARANTINE_SKIP(domain) )
+    if ( QUARANTINE_SKIP(domain, pdev) )
         return;
 
     ASSERT(pcidevs_locked());
@@ -430,7 +443,7 @@ static void amd_iommu_disable_domain_dev
 
         AMD_IOMMU_DEBUG("Disable: device id = %#x, "
                         "domain = %d, paging mode = %d\n",
-                        req_id,  domain->domain_id,
+                        req_id, dte->domain_id,
                         dom_iommu(domain)->arch.amd.paging_mode);
     }
     else
@@ -453,7 +466,7 @@ static int reassign_device(struct domain
         return -ENODEV;
     }
 
-    if ( !QUARANTINE_SKIP(target) )
+    if ( !QUARANTINE_SKIP(target, pdev) )
     {
         rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
         if ( rc )
@@ -655,6 +668,8 @@ static int amd_iommu_remove_device(u8 de
         AMD_IOMMU_WARN("%pd: unity unmapping failed for %pp\n",
                        pdev->domain, &pdev->sbdf);
 
+    amd_iommu_quarantine_teardown(pdev);
+
     iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
     pdev->arch.pseudo_domid = DOMID_INVALID;
 
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -443,21 +443,22 @@ int iommu_iotlb_flush_all(struct domain
     return rc;
 }
 
-static int __init iommu_quarantine_init(void)
+int iommu_quarantine_dev_init(device_t *dev)
 {
     const struct domain_iommu *hd = dom_iommu(dom_io);
-    int rc;
 
-    dom_io->options |= XEN_DOMCTL_CDF_iommu;
+    if ( !iommu_quarantine || !hd->platform_ops->quarantine_init )
+        return 0;
 
-    rc = iommu_domain_init(dom_io, 0);
-    if ( rc || iommu_quarantine < IOMMU_quarantine_scratch_page )
-        return rc;
+    return iommu_call(hd->platform_ops, quarantine_init,
+                      dev, iommu_quarantine == IOMMU_quarantine_scratch_page);
+}
 
-    if ( !hd->platform_ops->quarantine_init )
-        return 0;
+static int __init iommu_quarantine_init(void)
+{
+    dom_io->options |= XEN_DOMCTL_CDF_iommu;
 
-    return hd->platform_ops->quarantine_init(dom_io);
+    return iommu_domain_init(dom_io, 0);
 }
 
 int __init iommu_setup(void)
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -852,9 +852,16 @@ static int deassign_device(struct domain
         return -ENODEV;
 
     /* De-assignment from dom_io should de-quarantine the device */
-    target = ((pdev->quarantine || iommu_quarantine) &&
-              pdev->domain != dom_io) ?
-        dom_io : hardware_domain;
+    if ( (pdev->quarantine || iommu_quarantine) && pdev->domain != dom_io )
+    {
+        ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( ret )
+           return ret;
+
+        target = dom_io;
+    }
+    else
+        target = hardware_domain;
 
     while ( pdev->phantom_stride )
     {
@@ -1424,6 +1431,13 @@ static int assign_device(struct domain *
     if ( rc )
         goto done;
 
+    if ( pdev->domain != dom_io )
+    {
+        rc = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( rc )
+            goto done;
+    }
+
     pdev->fault.count = 0;
 
     if ( (rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag)) )
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -45,6 +45,11 @@
 
 /* dom_io is used as a sentinel for quarantined devices */
 #define QUARANTINE_SKIP(d, pgd_maddr) ((d) == dom_io && !(pgd_maddr))
+#define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
+                                             : (pdev)->arch.pseudo_domid)
+#define DEVICE_PGTABLE(d, pdev) ((d) != dom_io \
+                                 ? dom_iommu(d)->arch.vtd.pgd_maddr \
+                                 : (pdev)->arch.vtd.pgd_maddr)
 
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
@@ -88,13 +93,18 @@ static int get_iommu_did(domid_t domid,
 
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
+
+/*
+ * This function may have "context" passed as NULL, to merely obtain a DID
+ * for "domid".
+ */
 static int context_set_domain_id(struct context_entry *context,
                                  domid_t domid, struct vtd_iommu *iommu)
 {
     unsigned long nr_dom, i;
     int found = 0;
 
-    ASSERT(spin_is_locked(&iommu->lock));
+    ASSERT(pcidevs_locked());
 
     nr_dom = cap_ndoms(iommu->cap);
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
@@ -120,8 +130,13 @@ static int context_set_domain_id(struct
     }
 
     set_bit(i, iommu->domid_bitmap);
-    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
-    context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+
+    if ( context )
+    {
+        context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
+        context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+    }
+
     return 0;
 }
 
@@ -171,8 +186,12 @@ static void check_cleanup_domid_map(cons
                                     const struct pci_dev *exclude,
                                     struct vtd_iommu *iommu)
 {
-    bool found = any_pdev_behind_iommu(d, exclude, iommu);
+    bool found;
+
+    if ( d == dom_io )
+        return;
 
+    found = any_pdev_behind_iommu(d, exclude, iommu);
     /*
      * Hidden devices are associated with DomXEN but usable by the hardware
      * domain. Hence they need considering here as well.
@@ -1426,7 +1445,7 @@ int domain_context_mapping_one(
         domid = iommu->domid_map[prev_did];
         if ( domid < DOMID_FIRST_RESERVED )
             prev_dom = rcu_lock_domain_by_id(domid);
-        else if ( domid == DOMID_IO )
+        else if ( pdev ? domid == pdev->arch.pseudo_domid : domid > DOMID_MASK )
             prev_dom = rcu_lock_domain(dom_io);
         if ( !prev_dom )
         {
@@ -1582,15 +1601,12 @@ int domain_context_mapping_one(
     {
         if ( !prev_dom )
             ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                           domain->domain_id);
+                                           DEVICE_DOMID(domain, pdev));
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
-        {
-            hd = dom_iommu(prev_dom);
             ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
-                                             domain->domain_id,
-                                             hd->arch.vtd.pgd_maddr,
+                                             DEVICE_DOMID(prev_dom, pdev),
+                                             DEVICE_PGTABLE(prev_dom, pdev),
                                              mode & MAP_WITH_RMRR) < 0;
-        }
         else
             ret = 1;
 
@@ -1612,7 +1628,7 @@ static int domain_context_mapping(struct
 {
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
     const struct acpi_rmrr_unit *rmrr;
-    paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
+    paddr_t pgd_maddr = DEVICE_PGTABLE(domain, pdev);
     domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
@@ -1641,7 +1657,7 @@ static int domain_context_mapping(struct
         break;
     }
 
-    if ( domain != pdev->domain )
+    if ( domain != pdev->domain && pdev->domain != dom_io )
     {
         if ( pdev->domain->is_dying )
             mode |= MAP_OWNER_DYING;
@@ -1683,8 +1699,8 @@ static int domain_context_mapping(struct
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
+        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, pdev,
+                                         DEVICE_DOMID(domain, pdev), pgd_maddr,
                                          mode);
         if ( ret > 0 )
             ret = 0;
@@ -1710,8 +1726,8 @@ static int domain_context_mapping(struct
                    domain, &PCI_SBDF3(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
-                                         mode);
+                                         pdev, DEVICE_DOMID(domain, pdev),
+                                         pgd_maddr, mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1739,8 +1755,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1755,8 +1771,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         if ( ret )
         {
@@ -1798,9 +1814,6 @@ int domain_context_unmap_one(
     int iommu_domid, rc, ret;
     bool_t flush_dev_iotlb;
 
-    if ( QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) )
-        return 0;
-
     ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
 
@@ -1902,7 +1915,7 @@ static const struct acpi_drhd_unit *doma
             printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1916,7 +1929,7 @@ static const struct acpi_drhd_unit *doma
             printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( ret )
             break;
 
@@ -1939,18 +1952,12 @@ static const struct acpi_drhd_unit *doma
             break;
         }
 
+        ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                       DEVICE_DOMID(domain, pdev));
         /* PCIe to PCI/PCIx bridge */
-        if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
-        {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
-            if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
-                                               domain->domain_id);
-        }
-        else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
+        if ( !ret && pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
+            ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                           DEVICE_DOMID(domain, pdev));
 
         break;
 
@@ -1961,7 +1968,7 @@ static const struct acpi_drhd_unit *doma
     }
 
     if ( !ret && pdev->devfn == devfn &&
-         !QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) )
+         !QUARANTINE_SKIP(domain, pdev->arch.vtd.pgd_maddr) )
         check_cleanup_domid_map(domain, pdev, iommu);
 
     return drhd;
@@ -1994,6 +2001,26 @@ static void iommu_domain_teardown(struct
     XFREE(hd->arch.vtd.iommu_bitmap);
 }
 
+static void quarantine_teardown(struct pci_dev *pdev,
+                                const struct acpi_drhd_unit *drhd)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.vtd.pgd_maddr )
+        return;
+
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
+    page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list);
+    while ( iommu_free_pgtables(dom_io) == -ERESTART )
+        /* nothing */;
+    pdev->arch.vtd.pgd_maddr = 0;
+
+    if ( drhd )
+        cleanup_domid_map(pdev->arch.pseudo_domid, drhd->iommu);
+}
+
 static int __must_check intel_iommu_map_page(struct domain *d, dfn_t dfn,
                                              mfn_t mfn, unsigned int flags,
                                              unsigned int *flush_flags)
@@ -2218,6 +2245,8 @@ static int intel_iommu_remove_device(u8
                                rmrr->end_address, 0);
     }
 
+    quarantine_teardown(pdev, drhd);
+
     if ( drhd )
     {
         iommu_free_domid(pdev->arch.pseudo_domid,
@@ -2576,7 +2605,7 @@ static int reassign_device_ownership(
 {
     int ret;
 
-    if ( !QUARANTINE_SKIP(target, dom_iommu(target)->arch.vtd.pgd_maddr) )
+    if ( !QUARANTINE_SKIP(target, pdev->arch.vtd.pgd_maddr) )
     {
         if ( !has_arch_pdevs(target) )
             vmx_pi_hooks_assign(target);
@@ -2592,7 +2621,7 @@ static int reassign_device_ownership(
         ret = domain_context_mapping(target, devfn, pdev);
 
         if ( !ret && pdev->devfn == devfn &&
-             !QUARANTINE_SKIP(source, dom_iommu(source)->arch.vtd.pgd_maddr) )
+             !QUARANTINE_SKIP(source, pdev->arch.vtd.pgd_maddr) )
         {
             const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
 
@@ -2913,69 +2942,135 @@ static void vtd_dump_page_tables(struct
                               agaw_to_level(hd->arch.vtd.agaw), 0, 0);
 }
 
-static int __init intel_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(struct dma_pte *this, unsigned int level,
+                    struct page_info *pgs[6])
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_NUM; ++i )
+    {
+        struct dma_pte *pte = &this[i], *next;
+
+        if ( !dma_pte_present(*pte) )
+        {
+            if ( !pgs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                pgs[level] = iommu_alloc_pgtable(hd);
+                if ( !pgs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                if ( level )
+                {
+                    next = map_vtd_domain_page(page_to_maddr(pgs[level]));
+                    rc = fill_qpt(next, level - 1, pgs);
+                    unmap_vtd_domain_page(next);
+                }
+            }
+
+            dma_set_pte_addr(*pte, page_to_maddr(pgs[level]));
+            dma_set_pte_readable(*pte);
+            dma_set_pte_writable(*pte);
+        }
+        else if ( level && !dma_pte_superpage(*pte) )
+        {
+            next = map_vtd_domain_page(dma_pte_addr(*pte));
+            rc = fill_qpt(next, level - 1, pgs);
+            unmap_vtd_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+static int intel_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
     struct page_info *pg;
-    struct dma_pte *parent;
     unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
     unsigned int level = agaw_to_level(agaw);
-    int rc = 0;
+    const struct acpi_drhd_unit *drhd;
+    const struct acpi_rmrr_unit *rmrr;
+    unsigned int i, bdf;
+    bool rmrr_found = false;
+    int rc;
 
-    spin_lock(&hd->arch.mapping_lock);
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.vtd.pgd_maddr);
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
 
-    if ( hd->arch.vtd.pgd_maddr )
+    if ( pdev->arch.vtd.pgd_maddr )
     {
-        ASSERT_UNREACHABLE();
-        goto out;
+        clear_domain_page(pdev->arch.leaf_mfn);
+        return 0;
     }
 
-    pg = iommu_alloc_pgtable(hd);
+    drhd = acpi_find_matched_drhd_unit(pdev);
+    if ( !drhd )
+        return -ENODEV;
 
-    rc = -ENOMEM;
+    pg = iommu_alloc_pgtable(hd);
     if ( !pg )
-        goto out;
+        return -ENOMEM;
 
+    rc = context_set_domain_id(NULL, pdev->arch.pseudo_domid, drhd->iommu);
+
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
     hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
 
-    parent = map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
-    while ( level )
+    for_each_rmrr_device ( rmrr, bdf, i )
     {
-        uint64_t maddr;
-        unsigned int offset;
-
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        pg = iommu_alloc_pgtable(hd);
-
-        if ( !pg )
-            goto out;
+        if ( rc )
+            break;
 
-        maddr = page_to_maddr(pg);
-        for ( offset = 0; offset < PTE_NUM; offset++ )
+        if ( rmrr->segment == pdev->seg && bdf == pdev->sbdf.bdf )
         {
-            struct dma_pte *pte = &parent[offset];
+            rmrr_found = true;
 
-            dma_set_pte_addr(*pte, maddr);
-            dma_set_pte_readable(*pte);
+            rc = iommu_identity_mapping(dom_io, p2m_access_rw,
+                                        rmrr->base_address, rmrr->end_address,
+                                        0);
+            if ( rc )
+                printk(XENLOG_ERR VTDPREFIX
+                       "%pp: RMRR quarantine mapping failed\n",
+                       &pdev->sbdf);
         }
-        iommu_sync_cache(parent, PAGE_SIZE);
+    }
 
-        unmap_vtd_domain_page(parent);
-        parent = map_vtd_domain_page(maddr);
-        level--;
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.vtd.pgd_maddr = 0;
+    pdev->arch.vtd.pgd_maddr = page_to_maddr(pg);
+
+    if ( !rc && scratch_page )
+    {
+        struct dma_pte *root;
+        struct page_info *pgs[6] = {};
+
+        spin_lock(&hd->arch.mapping_lock);
+
+        root = map_vtd_domain_page(pdev->arch.vtd.pgd_maddr);
+        rc = fill_qpt(root, level - 1, pgs);
+        unmap_vtd_domain_page(root);
+
+        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
+
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_vtd_domain_page(parent);
 
-    rc = 0;
+    page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    if ( rc || (!scratch_page && !rmrr_found) )
+        quarantine_teardown(pdev, drhd);
 
-    /* Pages may be leaked in failure case */
     return rc;
 }
 
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -482,7 +482,7 @@ struct vtd_iommu {
     u32 nr_pt_levels;
     u64	cap;
     u64	ecap;
-    spinlock_t lock; /* protect context, domain ids */
+    spinlock_t lock; /* protect context */
     spinlock_t register_lock; /* protect iommu register handling */
     u64 root_maddr; /* root entry machine address */
     nodeid_t node;
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -233,7 +233,7 @@ typedef int iommu_grdm_t(xen_pfn_t start
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
-    int (*quarantine_init)(struct domain *d);
+    int (*quarantine_init)(device_t *dev, bool scratch_page);
     int (*add_device)(u8 devfn, device_t *dev);
     int (*enable_device)(device_t *dev);
     int (*remove_device)(u8 devfn, device_t *dev);
@@ -350,6 +350,7 @@ int __must_check iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
 int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
+int iommu_quarantine_dev_init(device_t *dev);
 
 #ifdef CONFIG_HAS_PCI
 int iommu_do_pci_domctl(struct xen_domctl *, struct domain *d,
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: drop ownership checking from domain_context_mapping_one()

Despite putting in quite a bit of effort it was not possible to
establish why exactly this code exists (beyond possibly sanity
checking). Instead of a subsequent change further complicating this
logic, simply get rid of it.

Take the opportunity and move the respective unmap_vtd_domain_page() out
of the locked region.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
v4: New.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -142,28 +142,6 @@ static int context_set_domain_id(struct
     return 0;
 }
 
-static int context_get_domain_id(const struct context_entry *context,
-                                 const struct vtd_iommu *iommu)
-{
-    int domid = -1;
-
-    if ( iommu && context )
-    {
-        unsigned int dom_index = context_domain_id(*context);
-
-        domid = did_to_domain_id(iommu, dom_index);
-        if ( domid == DOMID_INVALID )
-        {
-            dprintk(XENLOG_DEBUG VTDPREFIX,
-                    "no domid for did %u (nr_dom %u)\n",
-                    dom_index, cap_ndoms(iommu->cap));
-            domid = -1;
-        }
-    }
-
-    return domid;
-}
-
 static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
 {
     int iommu_domid;
@@ -1410,44 +1388,9 @@ int domain_context_mapping_one(
 
     if ( context_present(*context) )
     {
-        int res = 0;
-
-        /* Try to get domain ownership from device structure.  If that's
-         * not available, try to read it from the context itself. */
-        if ( pdev )
-        {
-            if ( pdev->domain != domain )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX "%pd: %pp owned by %pd",
-                       domain, &PCI_SBDF3(seg, bus, devfn),
-                       pdev->domain);
-                res = -EINVAL;
-            }
-        }
-        else
-        {
-            int cdomain;
-            cdomain = context_get_domain_id(context, iommu);
-            
-            if ( cdomain < 0 )
-            {
-                printk(XENLOG_G_WARNING VTDPREFIX
-                       "%pd: %pp mapped, but can't find owner\n",
-                       domain, &PCI_SBDF3(seg, bus, devfn));
-                res = -EINVAL;
-            }
-            else if ( cdomain != domain->domain_id )
-            {
-                printk(XENLOG_G_INFO VTDPREFIX
-                       "%pd: %pp already mapped to d%d",
-                       domain, &PCI_SBDF3(seg, bus, devfn), cdomain);
-                res = -EINVAL;
-            }
-        }
-
-        unmap_vtd_domain_page(context_entries);
         spin_unlock(&iommu->lock);
-        return res;
+        unmap_vtd_domain_page(context_entries);
+        return 0;
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: re-assign devices directly

Devices with RMRRs, due to it being unspecified how/when the specified
memory regions may get accessed, may not be left disconnected from their
respective mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than unmapping the old context and
then mapping the new one, re-assignment needs to be done in a single
step.

This is CVE-2022-26359 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any RMRRs. The
main difference is when it comes to updating context entries, which need
to be atomic when there are RMRRs. Yet atomicity can only be achieved
with CMPXCHG16B, availability of which we can't take for given.

The seemingly complicated choice of non-negative return values for
domain_context_mapping_one() is to limit code churn: This way callers
passing NULL for pdev don't need fiddling with.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
v7: Re-base.
v6: Also include devices with phantom functions in the not-single-device
    check. Shorten BDF comparison in domain_context_mapping().
v5: Refuse re-assignment of PCI devices behind a bridge and with RMRR.
    Shrink adjustment to context_set_domain_id(). Skip actual write in
    domain_context_mapping_one() for "same domain" case. Convert
    asserting of CX16 success to error return. Prefer
    any_pdev_behind_iommu() over list_{first,last}_entry_or_null().
    Refine warning message text and use PCI_SBDF3() there. Use correct
    domain to decide whether to warn. Move #define-s to vtd.h.
v4: Use atomic context entry updates when possible. Re-base over new
    earlier patch.
v3: Unmap on error when the device was previously unmapped.
v2: New.

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -81,7 +81,8 @@ void free_pgtable_maddr(u64 maddr);
 void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(const void *va);
 int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
-                               u8 bus, u8 devfn, const struct pci_dev *);
+                               uint8_t bus, uint8_t devfn,
+                               const struct pci_dev *pdev, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
                              u8 bus, u8 devfn);
 int cf_check intel_iommu_get_reserved_device_memory(
@@ -103,8 +104,8 @@ bool is_azalia_tlb_enabled(const struct
 void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
-int __must_check me_wifi_quirk(struct domain *domain,
-                               u8 bus, u8 devfn, int map);
+int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
+                               uint8_t devfn, unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -138,6 +138,7 @@ static int context_set_domain_id(struct
     else
         i = convert_domid(iommu, d->domain_id);
 
+    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
     context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
     return 0;
 }
@@ -1365,15 +1366,27 @@ static void __hwdom_init cf_check intel_
     }
 }
 
+/*
+ * This function returns
+ * - a negative errno value upon error,
+ * - zero upon success when previously the entry was non-present, or this isn't
+ *   the "main" request for a device (pdev == NULL), or for no-op quarantining
+ *   assignments,
+ * - positive (one) upon success when previously the entry was present and this
+ *   is the "main" request for a device (pdev != NULL).
+ */
 int domain_context_mapping_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
-    u8 bus, u8 devfn, const struct pci_dev *pdev)
+    uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
+    unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
-    struct context_entry *context, *context_entries;
+    struct context_entry *context, *context_entries, lctxt;
+    __uint128_t old;
     u64 maddr, pgd_maddr;
-    u16 seg = iommu->drhd->segment;
+    uint16_t seg = iommu->drhd->segment, prev_did = 0;
+    struct domain *prev_dom = NULL;
     int rc, ret;
     bool_t flush_dev_iotlb;
 
@@ -1385,17 +1398,32 @@ int domain_context_mapping_one(
     maddr = bus_to_context_maddr(iommu, bus);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
     context = &context_entries[devfn];
+    old = (lctxt = *context).full;
 
-    if ( context_present(*context) )
+    if ( context_present(lctxt) )
     {
-        spin_unlock(&iommu->lock);
-        unmap_vtd_domain_page(context_entries);
-        return 0;
+        domid_t domid;
+
+        prev_did = context_domain_id(lctxt);
+        domid = did_to_domain_id(iommu, prev_did);
+        if ( domid < DOMID_FIRST_RESERVED )
+            prev_dom = rcu_lock_domain_by_id(domid);
+        else if ( domid == DOMID_IO )
+            prev_dom = rcu_lock_domain(dom_io);
+        if ( !prev_dom )
+        {
+            spin_unlock(&iommu->lock);
+            unmap_vtd_domain_page(context_entries);
+            dprintk(XENLOG_DEBUG VTDPREFIX,
+                    "no domain for did %u (nr_dom %u)\n",
+                    prev_did, cap_ndoms(iommu->cap));
+            return -ESRCH;
+        }
     }
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
     {
-        context_set_translation_type(*context, CONTEXT_TT_PASS_THRU);
+        context_set_translation_type(lctxt, CONTEXT_TT_PASS_THRU);
     }
     else
     {
@@ -1407,37 +1435,108 @@ int domain_context_mapping_one(
             spin_unlock(&hd->arch.mapping_lock);
             spin_unlock(&iommu->lock);
             unmap_vtd_domain_page(context_entries);
+            if ( prev_dom )
+                rcu_unlock_domain(prev_dom);
             return -ENOMEM;
         }
 
-        context_set_address_root(*context, pgd_maddr);
+        context_set_address_root(lctxt, pgd_maddr);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
-            context_set_translation_type(*context, CONTEXT_TT_DEV_IOTLB);
+            context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
-            context_set_translation_type(*context, CONTEXT_TT_MULTI_LEVEL);
+            context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL);
 
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    rc = context_set_domain_id(context, domain, iommu);
+    rc = context_set_domain_id(&lctxt, domain, iommu);
     if ( rc )
     {
+    unlock:
         spin_unlock(&iommu->lock);
         unmap_vtd_domain_page(context_entries);
+        if ( prev_dom )
+            rcu_unlock_domain(prev_dom);
         return rc;
     }
 
-    context_set_address_width(*context, level_to_agaw(iommu->nr_pt_levels));
-    context_set_fault_enable(*context);
-    context_set_present(*context);
+    if ( !prev_dom )
+    {
+        context_set_address_width(lctxt, level_to_agaw(iommu->nr_pt_levels));
+        context_set_fault_enable(lctxt);
+        context_set_present(lctxt);
+    }
+    else if ( prev_dom == domain )
+    {
+        ASSERT(lctxt.full == context->full);
+        rc = !!pdev;
+        goto unlock;
+    }
+    else
+    {
+        ASSERT(context_address_width(lctxt) ==
+               level_to_agaw(iommu->nr_pt_levels));
+        ASSERT(!context_fault_disable(lctxt));
+    }
+
+    if ( cpu_has_cx16 )
+    {
+        __uint128_t res = cmpxchg16b(context, &old, &lctxt.full);
+
+        /*
+         * Hardware does not update the context entry behind our backs,
+         * so the return value should match "old".
+         */
+        if ( res != old )
+        {
+            if ( pdev )
+                check_cleanup_domid_map(domain, pdev, iommu);
+            printk(XENLOG_ERR
+                   "%pp: unexpected context entry %016lx_%016lx (expected %016lx_%016lx)\n",
+                   &PCI_SBDF3(pdev->seg, pdev->bus, devfn),
+                   (uint64_t)(res >> 64), (uint64_t)res,
+                   (uint64_t)(old >> 64), (uint64_t)old);
+            rc = -EILSEQ;
+            goto unlock;
+        }
+    }
+    else if ( !prev_dom || !(mode & MAP_WITH_RMRR) )
+    {
+        context_clear_present(*context);
+        iommu_sync_cache(context, sizeof(*context));
+
+        write_atomic(&context->hi, lctxt.hi);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->lo, lctxt.lo);
+    }
+    else /* Best effort, updating DID last. */
+    {
+         /*
+          * By non-atomically updating the context entry's DID field last,
+          * during a short window in time TLB entries with the old domain ID
+          * but the new page tables may be inserted.  This could affect I/O
+          * of other devices using this same (old) domain ID.  Such updating
+          * therefore is not a problem if this was the only device associated
+          * with the old domain ID.  Diverting I/O of any of a dying domain's
+          * devices to the quarantine page tables is intended anyway.
+          */
+        if ( !(mode & (MAP_OWNER_DYING | MAP_SINGLE_DEVICE)) )
+            printk(XENLOG_WARNING VTDPREFIX
+                   " %pp: reassignment may cause %pd data corruption\n",
+                   &PCI_SBDF3(seg, bus, devfn), prev_dom);
+
+        write_atomic(&context->lo, lctxt.lo);
+        /* No barrier should be needed between these two. */
+        write_atomic(&context->hi, lctxt.hi);
+    }
+
     iommu_sync_cache(context, sizeof(struct context_entry));
     spin_unlock(&iommu->lock);
 
-    /* Context entry was previously non-present (with domid 0). */
-    rc = iommu_flush_context_device(iommu, 0, PCI_BDF2(bus, devfn),
-                                    DMA_CCMD_MASK_NOBIT, 1);
+    rc = iommu_flush_context_device(iommu, prev_did, PCI_BDF2(bus, devfn),
+                                    DMA_CCMD_MASK_NOBIT, !prev_dom);
     flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-    ret = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb);
+    ret = iommu_flush_iotlb_dsi(iommu, prev_did, !prev_dom, flush_dev_iotlb);
 
     /*
      * The current logic for returns:
@@ -1458,17 +1557,26 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, mode);
 
     if ( rc )
     {
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        if ( !prev_dom )
+            ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+            ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                             mode & MAP_WITH_RMRR) < 0;
+        else
+            ret = 1;
 
         if ( !ret && pdev && pdev->devfn == devfn )
             check_cleanup_domid_map(domain, pdev, iommu);
     }
 
-    return rc;
+    if ( prev_dom )
+        rcu_unlock_domain(prev_dom);
+
+    return rc ?: pdev && prev_dom;
 }
 
 static int domain_context_unmap(struct domain *d, uint8_t devfn,
@@ -1478,8 +1586,10 @@ static int domain_context_mapping(struct
                                   struct pci_dev *pdev)
 {
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+    const struct acpi_rmrr_unit *rmrr;
     int ret = 0;
-    uint16_t seg = pdev->seg;
+    unsigned int i, mode = 0;
+    uint16_t seg = pdev->seg, bdf;
     uint8_t bus = pdev->bus, secbus;
 
     /*
@@ -1495,8 +1605,29 @@ static int domain_context_mapping(struct
 
     ASSERT(pcidevs_locked());
 
+    for_each_rmrr_device( rmrr, bdf, i )
+    {
+        if ( rmrr->segment != pdev->seg || bdf != pdev->sbdf.bdf )
+            continue;
+
+        mode |= MAP_WITH_RMRR;
+        break;
+    }
+
+    if ( domain != pdev->domain )
+    {
+        if ( pdev->domain->is_dying )
+            mode |= MAP_OWNER_DYING;
+        else if ( drhd &&
+                  !any_pdev_behind_iommu(pdev->domain, pdev, drhd->iommu) &&
+                  !pdev->phantom_stride )
+            mode |= MAP_SINGLE_DEVICE;
+    }
+
     switch ( pdev->type )
     {
+        bool prev_present;
+
     case DEV_TYPE_PCI_HOST_BRIDGE:
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:Hostbridge: skip %pp map\n",
@@ -1518,7 +1649,9 @@ static int domain_context_mapping(struct
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
+                                         pdev, mode);
+        if ( ret > 0 )
+            ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             enable_ats_device(pdev, &drhd->iommu->ats_devices);
 
@@ -1533,9 +1666,10 @@ static int domain_context_mapping(struct
                    domain, &PCI_SBDF3(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev);
-        if ( ret )
+                                         pdev, mode);
+        if ( ret < 0 )
             break;
+        prev_present = ret;
 
         if ( (ret = find_upstream_bridge(seg, &bus, &devfn, &secbus)) < 1 )
         {
@@ -1543,6 +1677,15 @@ static int domain_context_mapping(struct
                 break;
             ret = -ENXIO;
         }
+        /*
+         * Strictly speaking if the device is the only one behind this bridge
+         * and the only one with this (secbus,0,0) tuple, it could be allowed
+         * to be re-assigned regardless of RMRR presence.  But let's deal with
+         * that case only if it is actually found in the wild.
+         */
+        else if ( prev_present && (mode & MAP_WITH_RMRR) &&
+                  domain != pdev->domain )
+            ret = -EOPNOTSUPP;
 
         /*
          * Mapping a bridge should, if anything, pass the struct pci_dev of
@@ -1551,7 +1694,7 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL);
+                                             NULL, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1566,10 +1709,15 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL);
+                                             NULL, mode);
 
         if ( ret )
-            domain_context_unmap(domain, devfn, pdev);
+        {
+            if ( !prev_present )
+                domain_context_unmap(domain, devfn, pdev);
+            else if ( pdev->domain != domain ) /* Avoid infinite recursion. */
+                domain_context_mapping(pdev->domain, devfn, pdev);
+        }
 
         break;
 
@@ -2358,9 +2506,47 @@ static int cf_check reassign_device_owne
 {
     int ret;
 
-    ret = domain_context_unmap(source, devfn, pdev);
+    if ( !QUARANTINE_SKIP(target) )
+    {
+        if ( !has_arch_pdevs(target) )
+            vmx_pi_hooks_assign(target);
+
+        /*
+         * Devices assigned to untrusted domains (here assumed to be any domU)
+         * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
+         * by the root complex unless interrupt remapping is enabled.
+         */
+        if ( !iommu_intremap && !is_hardware_domain(target) &&
+             !is_system_domain(target) )
+            untrusted_msi = true;
+
+        ret = domain_context_mapping(target, devfn, pdev);
+
+        if ( !ret && !QUARANTINE_SKIP(source) && pdev->devfn == devfn )
+        {
+            const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
+
+            if ( drhd )
+                check_cleanup_domid_map(source, pdev, drhd->iommu);
+        }
+    }
+    else
+        ret = domain_context_unmap(source, devfn, pdev);
     if ( ret )
+    {
+        if ( !has_arch_pdevs(target) )
+            vmx_pi_hooks_deassign(target);
         return ret;
+    }
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
+
+    if ( !has_arch_pdevs(source) )
+        vmx_pi_hooks_deassign(source);
 
     /*
      * If the device belongs to the hardware domain, and it has RMRR, don't
@@ -2390,43 +2576,7 @@ static int cf_check reassign_device_owne
             }
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    if ( !has_arch_pdevs(source) )
-        vmx_pi_hooks_deassign(source);
-
-    if ( !has_arch_pdevs(target) )
-        vmx_pi_hooks_assign(target);
-
-    /*
-     * Devices assigned to untrusted domains (here assumed to be any domU)
-     * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected
-     * by the root complex unless interrupt remapping is enabled.
-     */
-    if ( !iommu_intremap && !is_hardware_domain(target) &&
-         !is_system_domain(target) )
-        untrusted_msi = true;
-
-    ret = domain_context_mapping(target, devfn, pdev);
-    if ( ret )
-    {
-        if ( !has_arch_pdevs(target) )
-            vmx_pi_hooks_deassign(target);
-
-        return ret;
-    }
-
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
-    }
-
-    return ret;
+    return 0;
 }
 
 static int cf_check intel_iommu_assign_device(
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -197,8 +197,12 @@ struct root_entry {
     do {(root).val |= ((value) & PAGE_MASK_4K);} while(0)
 
 struct context_entry {
-    u64 lo;
-    u64 hi;
+    union {
+        struct {
+            uint64_t lo, hi;
+        };
+        __uint128_t full;
+    };
 };
 #define ROOT_ENTRY_NR (PAGE_SIZE_4K/sizeof(struct root_entry))
 #define context_present(c) ((c).lo & 1)
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -407,7 +407,8 @@ void __init platform_quirks_init(void)
  */
 
 static int __must_check map_me_phantom_function(struct domain *domain,
-                                                u32 dev, int map)
+                                                unsigned int dev,
+                                                unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
     struct pci_dev *pdev;
@@ -418,9 +419,9 @@ static int __must_check map_me_phantom_f
     drhd = acpi_find_matched_drhd_unit(pdev);
 
     /* map or unmap ME phantom function */
-    if ( map )
+    if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL);
+                                        PCI_DEVFN(dev, 7), NULL, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
                                       PCI_DEVFN(dev, 7));
@@ -428,7 +429,8 @@ static int __must_check map_me_phantom_f
     return rc;
 }
 
-int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map)
+int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
+                  unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -452,7 +454,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, map);
+                rc = map_me_phantom_function(domain, 3, mode);
                 break;
             default:
                 break;
@@ -478,7 +480,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, map);
+                rc = map_me_phantom_function(domain, 22, mode);
                 break;
             default:
                 break;
--- a/xen/drivers/passthrough/vtd/vtd.h
+++ b/xen/drivers/passthrough/vtd/vtd.h
@@ -22,8 +22,14 @@
 
 #include <xen/iommu.h>
 
-#define MAP_ME_PHANTOM_FUNC      1
-#define UNMAP_ME_PHANTOM_FUNC    0
+/*
+ * Values for domain_context_mapping_one()'s and me_wifi_quirk()'s "mode"
+ * parameters.
+ */
+#define MAP_WITH_RMRR         (1u << 0)
+#define MAP_OWNER_DYING       (1u << 1)
+#define MAP_SINGLE_DEVICE     (1u << 2)
+#define UNMAP_ME_PHANTOM_FUNC (1u << 3)
 
 /* Allow for both IOAPIC and IOSAPIC. */
 #define IO_xAPIC_route_entry IO_APIC_route_entry
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: re-assign devices directly

Devices with unity map ranges, due to it being unspecified how/when
these memory ranges may get accessed, may not be left disconnected from
their unity mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than tearing down the old root page
table pointer and then establishing the new one, re-assignment needs to
be done in a single step.

This is CVE-2022-26360 / part of XSA-400.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>

Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.

To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any unity map
ranges.  The main difference is when it comes to updating DTEs, which need
to be atomic when there are unity mappings. Yet atomicity can only be
achieved with CMPXCHG16B, availability of which we can't take for given.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
v6: Also include devices with phantom functions in the not-single-device
    check. Add comment ahead of amd_iommu_set_root_page_table().
v5: Refuse re-assignment of devices with requestor ID different from BDF
    and with unity map. Convert asserting of CX16 success to error
    return. Introduce and use any_pdev_behind_iommu(). Refine warning
    message text and use PCI_SBDF3() there. Use correct domain to decide
    whether to warn.
v4: Refine when to attempt atomic DTE update. Add warning.
v3: Replace smp_wmb() by comment. Drop a redundant assignment.
v2: New.

--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -263,9 +263,13 @@ void amd_iommu_set_intremap_table(struct
                                   const void *ptr,
                                   const struct amd_iommu *iommu,
                                   bool valid);
-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
-				   uint64_t root_ptr, uint16_t domain_id,
-				   uint8_t paging_mode, bool valid);
+#define SET_ROOT_VALID          (1u << 0)
+#define SET_ROOT_WITH_UNITY_MAP (1u << 1)
+int __must_check amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+                                               uint64_t root_ptr,
+                                               uint16_t domain_id,
+                                               uint8_t paging_mode,
+                                               unsigned int flags);
 void iommu_dte_add_device_entry(struct amd_iommu_dte *dte,
                                 const struct ivrs_mappings *ivrs_dev);
 
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -114,10 +114,69 @@ static unsigned int set_iommu_ptes_prese
     return flush_flags;
 }
 
-void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
-                                   uint64_t root_ptr, uint16_t domain_id,
-                                   uint8_t paging_mode, bool valid)
+/*
+ * This function returns
+ * - -errno for errors,
+ * - 0 for a successful update, atomic when necessary
+ * - 1 for a successful but non-atomic update, which may need to be warned
+ *   about by the caller.
+ */
+int amd_iommu_set_root_page_table(struct amd_iommu_dte *dte,
+                                  uint64_t root_ptr, uint16_t domain_id,
+                                  uint8_t paging_mode, unsigned int flags)
 {
+    bool valid = flags & SET_ROOT_VALID;
+
+    if ( dte->v && dte->tv &&
+         (cpu_has_cx16 || (flags & SET_ROOT_WITH_UNITY_MAP)) )
+    {
+        union {
+            struct amd_iommu_dte dte;
+            uint64_t raw64[4];
+            __uint128_t raw128[2];
+        } ldte = { .dte = *dte };
+        __uint128_t old = ldte.raw128[0];
+        int ret = 0;
+
+        ldte.dte.domain_id = domain_id;
+        ldte.dte.pt_root = paddr_to_pfn(root_ptr);
+        ldte.dte.iw = true;
+        ldte.dte.ir = true;
+        ldte.dte.paging_mode = paging_mode;
+        ldte.dte.v = valid;
+
+        if ( cpu_has_cx16 )
+        {
+            __uint128_t res = cmpxchg16b(dte, &old, &ldte.raw128[0]);
+
+            /*
+             * Hardware does not update the DTE behind our backs, so the
+             * return value should match "old".
+             */
+            if ( res != old )
+            {
+                printk(XENLOG_ERR
+                       "Dom%d: unexpected DTE %016lx_%016lx (expected %016lx_%016lx)\n",
+                       domain_id,
+                       (uint64_t)(res >> 64), (uint64_t)res,
+                       (uint64_t)(old >> 64), (uint64_t)old);
+                ret = -EILSEQ;
+            }
+        }
+        else /* Best effort, updating domain_id last. */
+        {
+            uint64_t *ptr = (void *)dte;
+
+            write_atomic(ptr + 0, ldte.raw64[0]);
+            /* No barrier should be needed between these two. */
+            write_atomic(ptr + 1, ldte.raw64[1]);
+
+            ret = 1;
+        }
+
+        return ret;
+    }
+
     if ( valid || dte->v )
     {
         dte->tv = false;
@@ -132,6 +191,8 @@ void amd_iommu_set_root_page_table(struc
     smp_wmb();
     dte->tv = true;
     dte->v = valid;
+
+    return 0;
 }
 
 void amd_iommu_set_intremap_table(
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -96,13 +96,32 @@ static int __must_check allocate_domain_
     return rc;
 }
 
+static bool any_pdev_behind_iommu(const struct domain *d,
+                                  const struct pci_dev *exclude,
+                                  const struct amd_iommu *iommu)
+{
+    const struct pci_dev *pdev;
+
+    for_each_pdev ( d, pdev )
+    {
+        if ( pdev == exclude )
+            continue;
+
+        if ( find_iommu_for_device(pdev->seg, pdev->sbdf.bdf) == iommu )
+            return true;
+    }
+
+    return false;
+}
+
 static int __must_check amd_iommu_setup_domain_device(
     struct domain *domain, struct amd_iommu *iommu,
     uint8_t devfn, struct pci_dev *pdev)
 {
     struct amd_iommu_dte *table, *dte;
     unsigned long flags;
-    int req_id, valid = 1, rc;
+    unsigned int req_id, sr_flags;
+    int rc;
     u8 bus = pdev->bus;
     struct domain_iommu *hd = dom_iommu(domain);
     const struct ivrs_mappings *ivrs_dev;
@@ -116,8 +135,11 @@ static int __must_check amd_iommu_setup_
     if ( rc )
         return rc;
 
-    if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
-        valid = 0;
+    req_id = get_dma_requestor_id(iommu->seg, pdev->sbdf.bdf);
+    ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
+    sr_flags = (iommu_hwdom_passthrough && is_hardware_domain(domain)
+                ? 0 : SET_ROOT_VALID)
+               | (ivrs_dev->unity_map ? SET_ROOT_WITH_UNITY_MAP : 0);
 
     /* get device-table entry */
     req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
@@ -130,9 +152,15 @@ static int __must_check amd_iommu_setup_
     if ( !dte->v || !dte->tv )
     {
         /* bind DTE to domain page-tables */
-        amd_iommu_set_root_page_table(
-            dte, page_to_maddr(hd->arch.amd.root_table),
-            domain->domain_id, hd->arch.amd.paging_mode, valid);
+        rc = amd_iommu_set_root_page_table(
+                 dte, page_to_maddr(hd->arch.amd.root_table),
+                 domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+        if ( rc )
+        {
+            ASSERT(rc < 0);
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
 
         /* Undo what amd_iommu_disable_domain_device() may have done. */
         if ( dte->it_root )
@@ -152,17 +180,76 @@ static int __must_check amd_iommu_setup_
         spin_unlock_irqrestore(&iommu->lock, flags);
 
         amd_iommu_flush_device(iommu, req_id);
+    }
+    else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.amd.root_table)) )
+    {
+        /*
+         * Strictly speaking if the device is the only one with this requestor
+         * ID, it could be allowed to be re-assigned regardless of unity map
+         * presence.  But let's deal with that case only if it is actually
+         * found in the wild.
+         */
+        if ( req_id != PCI_BDF2(bus, devfn) &&
+             (sr_flags & SET_ROOT_WITH_UNITY_MAP) )
+            rc = -EOPNOTSUPP;
+        else
+            rc = amd_iommu_set_root_page_table(
+                     dte, page_to_maddr(hd->arch.amd.root_table),
+                     domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+        if ( rc < 0 )
+        {
+            spin_unlock_irqrestore(&iommu->lock, flags);
+            return rc;
+        }
+        if ( rc &&
+             domain != pdev->domain &&
+             /*
+              * By non-atomically updating the DTE's domain ID field last,
+              * during a short window in time TLB entries with the old domain
+              * ID but the new page tables may have been inserted.  This could
+              * affect I/O of other devices using this same (old) domain ID.
+              * Such updating therefore is not a problem if this was the only
+              * device associated with the old domain ID.  Diverting I/O of any
+              * of a dying domain's devices to the quarantine page tables is
+              * intended anyway.
+              */
+             !pdev->domain->is_dying &&
+             (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
+              pdev->phantom_stride) )
+            AMD_IOMMU_WARN(" %pp: reassignment may cause %pd data corruption\n",
+                           &PCI_SBDF3(pdev->seg, bus, devfn), pdev->domain);
+
+        /*
+         * Check remaining settings are still in place from an earlier call
+         * here. They're all independent of the domain, so should not have
+         * changed.
+         */
+        if ( dte->it_root )
+            ASSERT(dte->int_ctl == IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED);
+        ASSERT(dte->iv == iommu_intremap);
+        ASSERT(dte->ex == ivrs_dev->dte_allow_exclusion);
+        ASSERT(dte->sys_mgt == MASK_EXTR(ivrs_dev->device_flags,
+                                         ACPI_IVHD_SYSTEM_MGMT));
 
-        AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
-                        "root table = %#"PRIx64", "
-                        "domain = %d, paging mode = %d\n",
-                        req_id, pdev->type,
-                        page_to_maddr(hd->arch.amd.root_table),
-                        domain->domain_id, hd->arch.amd.paging_mode);
+        if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
+             !ivrs_dev->block_ats &&
+             iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) )
+            ASSERT(dte->i == ats_enabled);
+
+        spin_unlock_irqrestore(&iommu->lock, flags);
+
+        amd_iommu_flush_device(iommu, req_id);
     }
     else
         spin_unlock_irqrestore(&iommu->lock, flags);
 
+    AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
+                    "root table = %#"PRIx64", "
+                    "domain = %d, paging mode = %d\n",
+                    req_id, pdev->type,
+                    page_to_maddr(hd->arch.amd.root_table),
+                    domain->domain_id, hd->arch.amd.paging_mode);
+
     ASSERT(pcidevs_locked());
 
     if ( pci_ats_device(iommu->seg, bus, pdev->devfn) &&
@@ -367,7 +454,20 @@ static int cf_check reassign_device(
         return -ENODEV;
     }
 
-    amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
+    if ( !QUARANTINE_SKIP(target) )
+    {
+        rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
+        if ( rc )
+            return rc;
+    }
+    else
+        amd_iommu_disable_domain_device(source, iommu, devfn, pdev);
+
+    if ( devfn == pdev->devfn && pdev->domain != target )
+    {
+        list_move(&pdev->domain_list, &target->pdev_list);
+        pdev->domain = target;
+    }
 
     /*
      * If the device belongs to the hardware domain, and it has a unity mapping,
@@ -383,25 +483,9 @@ static int cf_check reassign_device(
             return rc;
     }
 
-    if ( devfn == pdev->devfn && pdev->domain != dom_io )
-    {
-        list_move(&pdev->domain_list, &dom_io->pdev_list);
-        pdev->domain = dom_io;
-    }
-
-    rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
-    if ( rc )
-        return rc;
-
     AMD_IOMMU_DEBUG("Re-assign %pp from dom%d to dom%d\n",
                     &pdev->sbdf, source->domain_id, target->domain_id);
 
-    if ( devfn == pdev->devfn && pdev->domain != target )
-    {
-        list_move(&pdev->domain_list, &target->pdev_list);
-        pdev->domain = target;
-    }
-
     return 0;
 }
 
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part I)

Arrange for domain ID and page table root to be passed around, the latter in
particular to domain_pgd_maddr() such that taking it from the per-domain
fields can be overridden.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
v7: Re-base.
v5: Also pass around domain ID.
v4: Drop stray insertion of a blank line in the middle of a comment.
    Re-base over new earlier patch and other changes earlier in the
    series.
v3: Re-base over changes earlier in the series.
v2: New.

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -82,9 +82,10 @@ void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(const void *va);
 int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu,
                                uint8_t bus, uint8_t devfn,
-                               const struct pci_dev *pdev, unsigned int mode);
+                               const struct pci_dev *pdev, domid_t domid,
+                               paddr_t pgd_maddr, unsigned int mode);
 int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu,
-                             u8 bus, u8 devfn);
+                             uint8_t bus, uint8_t devfn, domid_t domid);
 int cf_check intel_iommu_get_reserved_device_memory(
     iommu_grdm_t *func, void *ctxt);
 
@@ -105,7 +106,8 @@ void platform_quirks_init(void);
 void vtd_ops_preamble_quirk(struct vtd_iommu *iommu);
 void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
 int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus,
-                               uint8_t devfn, unsigned int mode);
+                               uint8_t devfn, domid_t domid, paddr_t pgd_maddr,
+                               unsigned int mode);
 void pci_vtd_quirk(const struct pci_dev *);
 void quirk_iommu_caps(struct vtd_iommu *iommu);
 
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -43,7 +43,7 @@
 #include "../ats.h"
 
 /* dom_io is used as a sentinel for quarantined devices */
-#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.vtd.pgd_maddr)
+#define QUARANTINE_SKIP(d, pgd_maddr) ((d) == dom_io && !(pgd_maddr))
 
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
@@ -346,15 +346,17 @@ static u64 addr_to_dma_page_maddr(struct
     return pte_maddr;
 }
 
-static uint64_t domain_pgd_maddr(struct domain *d, unsigned int nr_pt_levels)
+static paddr_t domain_pgd_maddr(struct domain *d, paddr_t pgd_maddr,
+                                unsigned int nr_pt_levels)
 {
     struct domain_iommu *hd = dom_iommu(d);
-    uint64_t pgd_maddr;
     unsigned int agaw;
 
     ASSERT(spin_is_locked(&hd->arch.mapping_lock));
 
-    if ( iommu_use_hap_pt(d) )
+    if ( pgd_maddr )
+        /* nothing */;
+    else if ( iommu_use_hap_pt(d) )
     {
         pagetable_t pgt = p2m_get_pagetable(p2m_get_hostp2m(d));
 
@@ -1379,18 +1381,18 @@ int domain_context_mapping_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
     uint8_t bus, uint8_t devfn, const struct pci_dev *pdev,
-    unsigned int mode)
+    domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     struct domain_iommu *hd = dom_iommu(domain);
     struct context_entry *context, *context_entries, lctxt;
     __uint128_t old;
-    u64 maddr, pgd_maddr;
+    uint64_t maddr;
     uint16_t seg = iommu->drhd->segment, prev_did = 0;
     struct domain *prev_dom = NULL;
     int rc, ret;
     bool_t flush_dev_iotlb;
 
-    if ( QUARANTINE_SKIP(domain) )
+    if ( QUARANTINE_SKIP(domain, pgd_maddr) )
         return 0;
 
     ASSERT(pcidevs_locked());
@@ -1427,10 +1429,12 @@ int domain_context_mapping_one(
     }
     else
     {
+        paddr_t root;
+
         spin_lock(&hd->arch.mapping_lock);
 
-        pgd_maddr = domain_pgd_maddr(domain, iommu->nr_pt_levels);
-        if ( !pgd_maddr )
+        root = domain_pgd_maddr(domain, pgd_maddr, iommu->nr_pt_levels);
+        if ( !root )
         {
             spin_unlock(&hd->arch.mapping_lock);
             spin_unlock(&iommu->lock);
@@ -1440,7 +1444,7 @@ int domain_context_mapping_one(
             return -ENOMEM;
         }
 
-        context_set_address_root(lctxt, pgd_maddr);
+        context_set_address_root(lctxt, root);
         if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) )
             context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB);
         else
@@ -1557,15 +1561,21 @@ int domain_context_mapping_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !seg && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, mode);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode);
 
     if ( rc )
     {
         if ( !prev_dom )
-            ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+            ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                           domain->domain_id);
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
+        {
+            hd = dom_iommu(prev_dom);
             ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
+                                             domain->domain_id,
+                                             hd->arch.vtd.pgd_maddr,
                                              mode & MAP_WITH_RMRR) < 0;
+        }
         else
             ret = 1;
 
@@ -1587,6 +1597,7 @@ static int domain_context_mapping(struct
 {
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
     const struct acpi_rmrr_unit *rmrr;
+    paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1649,7 +1660,8 @@ static int domain_context_mapping(struct
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret > 0 )
             ret = 0;
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
@@ -1666,7 +1678,8 @@ static int domain_context_mapping(struct
                    domain, &PCI_SBDF3(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, mode);
+                                         pdev, domain->domain_id, pgd_maddr,
+                                         mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1694,7 +1707,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1709,7 +1723,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, mode);
+                                             NULL, domain->domain_id, pgd_maddr,
+                                             mode);
 
         if ( ret )
         {
@@ -1737,14 +1752,14 @@ static int domain_context_mapping(struct
 int domain_context_unmap_one(
     struct domain *domain,
     struct vtd_iommu *iommu,
-    u8 bus, u8 devfn)
+    uint8_t bus, uint8_t devfn, domid_t domid)
 {
     struct context_entry *context, *context_entries;
     u64 maddr;
     int iommu_domid, rc, ret;
     bool_t flush_dev_iotlb;
 
-    if ( QUARANTINE_SKIP(domain) )
+    if ( QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) )
         return 0;
 
     ASSERT(pcidevs_locked());
@@ -1798,7 +1813,7 @@ int domain_context_unmap_one(
     unmap_vtd_domain_page(context_entries);
 
     if ( !iommu->drhd->segment && !rc )
-        rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC);
+        rc = me_wifi_quirk(domain, bus, devfn, domid, 0, UNMAP_ME_PHANTOM_FUNC);
 
     if ( rc && !is_hardware_domain(domain) && domain != dom_io )
     {
@@ -1845,7 +1860,8 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1858,7 +1874,8 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        ret = domain_context_unmap_one(domain, iommu, bus, devfn);
+        ret = domain_context_unmap_one(domain, iommu, bus, devfn,
+                                       domain->domain_id);
         if ( ret )
             break;
 
@@ -1884,12 +1901,15 @@ static int domain_context_unmap(struct d
         /* PCIe to PCI/PCIx bridge */
         if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
         {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
             if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0);
+                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                               domain->domain_id);
         }
         else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn);
+            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                           domain->domain_id);
 
         break;
 
@@ -1899,7 +1919,8 @@ static int domain_context_unmap(struct d
         return -EINVAL;
     }
 
-    if ( !ret && !QUARANTINE_SKIP(domain) && pdev->devfn == devfn )
+    if ( !ret && pdev->devfn == devfn &&
+         !QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) )
         check_cleanup_domid_map(domain, pdev, iommu);
 
     return ret;
@@ -2506,7 +2527,7 @@ static int cf_check reassign_device_owne
 {
     int ret;
 
-    if ( !QUARANTINE_SKIP(target) )
+    if ( !QUARANTINE_SKIP(target, dom_iommu(target)->arch.vtd.pgd_maddr) )
     {
         if ( !has_arch_pdevs(target) )
             vmx_pi_hooks_assign(target);
@@ -2522,7 +2543,8 @@ static int cf_check reassign_device_owne
 
         ret = domain_context_mapping(target, devfn, pdev);
 
-        if ( !ret && !QUARANTINE_SKIP(source) && pdev->devfn == devfn )
+        if ( !ret && pdev->devfn == devfn &&
+             !QUARANTINE_SKIP(source, dom_iommu(source)->arch.vtd.pgd_maddr) )
         {
             const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
 
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -408,6 +408,8 @@ void __init platform_quirks_init(void)
 
 static int __must_check map_me_phantom_function(struct domain *domain,
                                                 unsigned int dev,
+                                                domid_t domid,
+                                                paddr_t pgd_maddr,
                                                 unsigned int mode)
 {
     struct acpi_drhd_unit *drhd;
@@ -421,16 +423,17 @@ static int __must_check map_me_phantom_f
     /* map or unmap ME phantom function */
     if ( !(mode & UNMAP_ME_PHANTOM_FUNC) )
         rc = domain_context_mapping_one(domain, drhd->iommu, 0,
-                                        PCI_DEVFN(dev, 7), NULL, mode);
+                                        PCI_DEVFN(dev, 7), NULL,
+                                        domid, pgd_maddr, mode);
     else
         rc = domain_context_unmap_one(domain, drhd->iommu, 0,
-                                      PCI_DEVFN(dev, 7));
+                                      PCI_DEVFN(dev, 7), domid);
 
     return rc;
 }
 
 int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn,
-                  unsigned int mode)
+                  domid_t domid, paddr_t pgd_maddr, unsigned int mode)
 {
     u32 id;
     int rc = 0;
@@ -454,7 +457,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x423b8086:
             case 0x423c8086:
             case 0x423d8086:
-                rc = map_me_phantom_function(domain, 3, mode);
+                rc = map_me_phantom_function(domain, 3, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
@@ -480,7 +483,7 @@ int me_wifi_quirk(struct domain *domain,
             case 0x42388086:        /* Puma Peak */
             case 0x422b8086:
             case 0x422c8086:
-                rc = map_me_phantom_function(domain, 22, mode);
+                rc = map_me_phantom_function(domain, 22, domid, pgd_maddr, mode);
                 break;
             default:
                 break;
From: Jan Beulich <jbeulich@suse.com>
Subject: VT-d: prepare for per-device quarantine page tables (part II)

Replace the passing of struct domain * by domid_t in preparation of
per-device quarantine page tables also requiring per-device pseudo
domain IDs, which aren't going to be associated with any struct domain
instances.

No functional change intended (except for slightly adjusted log message
text).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
v6: context_set_domain_id()'s ASSERT() change moved to a later patch.
v5: New.

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -79,28 +79,28 @@ static domid_t convert_domid(const struc
     return !cap_caching_mode(iommu->cap) ? domid : ~domid;
 }
 
-static int domain_iommu_domid(const struct domain *d,
-                              const struct vtd_iommu *iommu)
+static int get_iommu_did(domid_t domid, const struct vtd_iommu *iommu,
+                         bool warn)
 {
     unsigned int nr_dom, i;
 
     if ( !domid_mapping(iommu) )
-        return convert_domid(iommu, d->domain_id);
+        return convert_domid(iommu, domid);
 
     nr_dom = cap_ndoms(iommu->cap);
     i = find_first_bit(iommu->domid_bitmap, nr_dom);
     while ( i < nr_dom )
     {
-        if ( iommu->domid_map[i] == d->domain_id )
+        if ( iommu->domid_map[i] == domid )
             return i;
 
         i = find_next_bit(iommu->domid_bitmap, nr_dom, i + 1);
     }
 
-    if ( !d->is_dying )
+    if ( warn )
         dprintk(XENLOG_ERR VTDPREFIX,
-                "Cannot get valid iommu %u domid: %pd\n",
-                iommu->index, d);
+                "No valid iommu %u domid for Dom%d\n",
+                iommu->index, domid);
 
     return -1;
 }
@@ -108,8 +108,7 @@ static int domain_iommu_domid(const stru
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
 static int context_set_domain_id(struct context_entry *context,
-                                 const struct domain *d,
-                                 struct vtd_iommu *iommu)
+                                 domid_t domid, struct vtd_iommu *iommu)
 {
     unsigned int i;
 
@@ -120,7 +119,7 @@ static int context_set_domain_id(struct
         unsigned int nr_dom = cap_ndoms(iommu->cap);
 
         i = find_first_bit(iommu->domid_bitmap, nr_dom);
-        while ( i < nr_dom && iommu->domid_map[i] != d->domain_id )
+        while ( i < nr_dom && iommu->domid_map[i] != domid )
             i = find_next_bit(iommu->domid_bitmap, nr_dom, i + 1);
 
         if ( i >= nr_dom )
@@ -131,26 +130,26 @@ static int context_set_domain_id(struct
                 dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: no free domain id\n");
                 return -EBUSY;
             }
-            iommu->domid_map[i] = d->domain_id;
+            iommu->domid_map[i] = domid;
             set_bit(i, iommu->domid_bitmap);
         }
     }
     else
-        i = convert_domid(iommu, d->domain_id);
+        i = convert_domid(iommu, domid);
 
     context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
     context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
     return 0;
 }
 
-static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu)
+static void cleanup_domid_map(domid_t domid, struct vtd_iommu *iommu)
 {
     int iommu_domid;
 
     if ( !domid_mapping(iommu) )
         return;
 
-    iommu_domid = domain_iommu_domid(domain, iommu);
+    iommu_domid = get_iommu_did(domid, iommu, false);
 
     if ( iommu_domid >= 0 )
     {
@@ -190,7 +189,7 @@ static bool any_pdev_behind_iommu(const
  * If no other devices under the same iommu owned by this domain,
  * clear iommu in iommu_bitmap and clear domain_id in domid_bitmap.
  */
-static void check_cleanup_domid_map(struct domain *d,
+static void check_cleanup_domid_map(const struct domain *d,
                                     const struct pci_dev *exclude,
                                     struct vtd_iommu *iommu)
 {
@@ -206,7 +205,7 @@ static void check_cleanup_domid_map(stru
     if ( !found )
     {
         clear_bit(iommu->index, dom_iommu(d)->arch.vtd.iommu_bitmap);
-        cleanup_domid_map(d, iommu);
+        cleanup_domid_map(d->domain_id, iommu);
     }
 }
 
@@ -667,7 +666,7 @@ static int __must_check iommu_flush_iotl
             continue;
 
         flush_dev_iotlb = !!find_ats_dev_drhd(iommu);
-        iommu_domid= domain_iommu_domid(d, iommu);
+        iommu_domid = get_iommu_did(d->domain_id, iommu, !d->is_dying);
         if ( iommu_domid == -1 )
             continue;
 
@@ -1453,7 +1452,7 @@ int domain_context_mapping_one(
         spin_unlock(&hd->arch.mapping_lock);
     }
 
-    rc = context_set_domain_id(&lctxt, domain, iommu);
+    rc = context_set_domain_id(&lctxt, domid, iommu);
     if ( rc )
     {
     unlock:
@@ -1780,7 +1779,7 @@ int domain_context_unmap_one(
     context_clear_entry(*context);
     iommu_sync_cache(context, sizeof(struct context_entry));
 
-    iommu_domid= domain_iommu_domid(domain, iommu);
+    iommu_domid = get_iommu_did(domid, iommu, !domain->is_dying);
     if ( iommu_domid == -1 )
     {
         spin_unlock(&iommu->lock);
@@ -1948,7 +1947,7 @@ static void cf_check iommu_domain_teardo
     ASSERT(!hd->arch.vtd.pgd_maddr);
 
     for_each_drhd_unit ( drhd )
-        cleanup_domid_map(d, drhd->iommu);
+        cleanup_domid_map(d->domain_id, drhd->iommu);
 
     XFREE(hd->arch.vtd.iommu_bitmap);
 }
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: maintain a per-device pseudo domain ID

In order to subsequently enable per-device quarantine page tables, we'll
need domain-ID-like identifiers to be inserted in the respective device
(AMD) or context (Intel) table entries alongside the per-device page
table root addresses.

Make use of "real" domain IDs occupying only half of the value range
coverable by domid_t.

Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
leaks in case of error, but existing ones don't get plugged - that'll be
the subject of a later change.

The VT-d changes are slightly asymmetric, but this way we can avoid
assigning pseudo domain IDs to devices which would never be mapped while
still avoiding to add a new parameter to domain_context_unmap().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
v7: Re-base.
v6: Also adjust nr_dom range check in iommu_alloc().
v5: New.

--- a/xen/arch/x86/include/asm/iommu.h
+++ b/xen/arch/x86/include/asm/iommu.h
@@ -141,6 +141,10 @@ static inline void iommu_sync_cache(cons
         cache_writeback(addr, size);
 }
 
+unsigned long *iommu_init_domid(domid_t reserve);
+domid_t iommu_alloc_domid(unsigned long *map);
+void iommu_free_domid(domid_t domid, unsigned long *map);
+
 int __must_check iommu_free_pgtables(struct domain *d);
 struct domain_iommu;
 struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd);
--- a/xen/arch/x86/include/asm/pci.h
+++ b/xen/arch/x86/include/asm/pci.h
@@ -13,6 +13,12 @@
 
 struct arch_pci_dev {
     vmask_t used_vectors;
+    /*
+     * These fields are (de)initialized under pcidevs-lock. Other uses of
+     * them don't race (de)initialization and hence don't strictly need any
+     * locking.
+     */
+    domid_t pseudo_domid;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
@@ -36,6 +42,6 @@ static always_inline bool is_pci_passthr
     return true;
 }
 
-static inline void arch_pci_init_pdev(struct pci_dev *pdev) {}
+void arch_pci_init_pdev(struct pci_dev *pdev);
 
 #endif /* __X86_PCI_H__ */
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -96,6 +96,7 @@ struct amd_iommu {
     struct ring_buffer cmd_buffer;
     struct ring_buffer event_log;
     struct ring_buffer ppr_log;
+    unsigned long *domid_map;
 
     int exclusion_enable;
     int exclusion_allow_all;
--- a/xen/drivers/passthrough/amd/iommu_detect.c
+++ b/xen/drivers/passthrough/amd/iommu_detect.c
@@ -223,6 +223,11 @@ int __init amd_iommu_detect_one_acpi(
     if ( rt )
         goto out;
 
+    iommu->domid_map = iommu_init_domid(DOMID_INVALID);
+    rt = -ENOMEM;
+    if ( !iommu->domid_map )
+        goto out;
+
     rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func));
     if ( rt )
         printk(XENLOG_ERR "Could not mark config space of %pp read-only (%d)\n",
@@ -233,7 +238,10 @@ int __init amd_iommu_detect_one_acpi(
 
  out:
     if ( rt )
+    {
+        xfree(iommu->domid_map);
         xfree(iommu);
+    }
 
     return rt;
 }
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -539,6 +539,8 @@ static int cf_check amd_iommu_add_device
     struct amd_iommu *iommu;
     u16 bdf;
     struct ivrs_mappings *ivrs_mappings;
+    bool fresh_domid = false;
+    int ret;
 
     if ( !pdev->domain )
         return -EINVAL;
@@ -606,7 +608,22 @@ static int cf_check amd_iommu_add_device
         AMD_IOMMU_WARN("%pd: unity mapping failed for %pp\n",
                        pdev->domain, &pdev->sbdf);
 
-    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( iommu_quarantine && pdev->arch.pseudo_domid == DOMID_INVALID )
+    {
+        pdev->arch.pseudo_domid = iommu_alloc_domid(iommu->domid_map);
+        if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+            return -ENOSPC;
+        fresh_domid = true;
+    }
+
+    ret = amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
+    if ( ret && fresh_domid )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
+    return ret;
 }
 
 static int cf_check amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
@@ -638,6 +655,9 @@ static int cf_check amd_iommu_remove_dev
         AMD_IOMMU_WARN("%pd: unity unmapping failed for %pp\n",
                        pdev->domain, &pdev->sbdf);
 
+    iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
+    pdev->arch.pseudo_domid = DOMID_INVALID;
+
     if ( amd_iommu_perdev_intremap &&
          ivrs_mappings[bdf].dte_requestor_id == bdf &&
          ivrs_mappings[bdf].intremap_table )
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1342,9 +1342,14 @@ static int cf_check _dump_pci_devices(st
 
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
     {
-        printk("%pp - %pd - node %-3d",
-               &pdev->sbdf, pdev->domain,
-               (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
+        printk("%pp - ", &pdev->sbdf);
+#ifdef CONFIG_X86
+        if ( pdev->domain == dom_io )
+            printk("DomIO:%x", pdev->arch.pseudo_domid);
+        else
+#endif
+            printk("%pd", pdev->domain);
+        printk(" - node %-3d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
         pdev_dump_msi(pdev);
         printk("\n");
     }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -22,6 +22,7 @@
 #include <xen/sched.h>
 #include <xen/xmalloc.h>
 #include <xen/domain_page.h>
+#include <xen/err.h>
 #include <xen/iocap.h>
 #include <xen/iommu.h>
 #include <xen/numa.h>
@@ -1199,6 +1200,8 @@ int __init iommu_alloc(struct acpi_drhd_
 {
     struct vtd_iommu *iommu;
     unsigned int sagaw, agaw = 0, nr_dom;
+    domid_t reserved_domid = DOMID_INVALID;
+    int rc;
 
     iommu = xzalloc(struct vtd_iommu);
     if ( iommu == NULL )
@@ -1269,7 +1272,7 @@ int __init iommu_alloc(struct acpi_drhd_
 
     nr_dom = cap_ndoms(iommu->cap);
 
-    if ( nr_dom <= DOMID_MASK + cap_caching_mode(iommu->cap) )
+    if ( nr_dom <= DOMID_MASK * 2 + cap_caching_mode(iommu->cap) )
     {
         /* Allocate domain id (bit) maps. */
         iommu->domid_bitmap = xzalloc_array(unsigned long,
@@ -1293,9 +1296,24 @@ int __init iommu_alloc(struct acpi_drhd_
         /* Don't leave dangling NULL pointers. */
         iommu->domid_bitmap = ZERO_BLOCK_PTR;
         iommu->domid_map = ZERO_BLOCK_PTR;
+
+        /*
+         * If Caching mode is set, then invalid translations are tagged
+         * with domain id 0. Hence reserve the ID taking up bit/slot 0.
+         */
+        reserved_domid = convert_domid(iommu, 0) ?: DOMID_INVALID;
     }
 
+    iommu->pseudo_domid_map = iommu_init_domid(reserved_domid);
+    rc = -ENOMEM;
+    if ( !iommu->pseudo_domid_map )
+        goto free;
+
     return 0;
+
+ free:
+    iommu_free(drhd);
+    return rc;
 }
 
 void __init iommu_free(struct acpi_drhd_unit *drhd)
@@ -1318,6 +1336,7 @@ void __init iommu_free(struct acpi_drhd_
 
     xfree(iommu->domid_bitmap);
     xfree(iommu->domid_map);
+    xfree(iommu->pseudo_domid_map);
 
     if ( iommu->msi.irq >= 0 )
         destroy_irq(iommu->msi.irq);
@@ -1588,8 +1607,8 @@ int domain_context_mapping_one(
     return rc ?: pdev && prev_dom;
 }
 
-static int domain_context_unmap(struct domain *d, uint8_t devfn,
-                                struct pci_dev *pdev);
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *d, uint8_t devfn, struct pci_dev *pdev);
 
 static int domain_context_mapping(struct domain *domain, u8 devfn,
                                   struct pci_dev *pdev)
@@ -1597,6 +1616,7 @@ static int domain_context_mapping(struct
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
     const struct acpi_rmrr_unit *rmrr;
     paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
+    domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
     uint16_t seg = pdev->seg, bdf;
@@ -1655,6 +1675,14 @@ static int domain_context_mapping(struct
         if ( !drhd )
             return -ENODEV;
 
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
@@ -1672,6 +1700,14 @@ static int domain_context_mapping(struct
         if ( !drhd )
             return -ENODEV;
 
+        if ( iommu_quarantine && orig_domid == DOMID_INVALID )
+        {
+            pdev->arch.pseudo_domid =
+                iommu_alloc_domid(drhd->iommu->pseudo_domid_map);
+            if ( pdev->arch.pseudo_domid == DOMID_INVALID )
+                return -ENOSPC;
+        }
+
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCI: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
@@ -1745,6 +1781,13 @@ static int domain_context_mapping(struct
     if ( !ret && devfn == pdev->devfn )
         pci_vtd_quirk(pdev);
 
+    if ( ret && drhd && orig_domid == DOMID_INVALID )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return ret;
 }
 
@@ -1830,8 +1873,10 @@ int domain_context_unmap_one(
     return rc;
 }
 
-static int domain_context_unmap(struct domain *domain, u8 devfn,
-                                struct pci_dev *pdev)
+static const struct acpi_drhd_unit *domain_context_unmap(
+    struct domain *domain,
+    uint8_t devfn,
+    struct pci_dev *pdev)
 {
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
     struct vtd_iommu *iommu = drhd ? drhd->iommu : NULL;
@@ -1845,16 +1890,16 @@ static int domain_context_unmap(struct d
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:Hostbridge: skip %pp unmap\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        return is_hardware_domain(domain) ? 0 : -EPERM;
+        return ERR_PTR(is_hardware_domain(domain) ? 0 : -EPERM);
 
     case DEV_TYPE_PCIe_BRIDGE:
     case DEV_TYPE_PCIe2PCI_BRIDGE:
     case DEV_TYPE_LEGACY_PCI_BRIDGE:
-        return 0;
+        return ERR_PTR(0);
 
     case DEV_TYPE_PCIe_ENDPOINT:
         if ( !iommu )
-            return -ENODEV;
+            return ERR_PTR(-ENODEV);
 
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
@@ -1868,7 +1913,7 @@ static int domain_context_unmap(struct d
 
     case DEV_TYPE_PCI:
         if ( !iommu )
-            return -ENODEV;
+            return ERR_PTR(-ENODEV);
 
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
@@ -1915,14 +1960,14 @@ static int domain_context_unmap(struct d
     default:
         dprintk(XENLOG_ERR VTDPREFIX, "%pd:unknown(%u): %pp\n",
                 domain, pdev->type, &PCI_SBDF3(seg, bus, devfn));
-        return -EINVAL;
+        return ERR_PTR(-EINVAL);
     }
 
     if ( !ret && pdev->devfn == devfn &&
          !QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) )
         check_cleanup_domid_map(domain, pdev, iommu);
 
-    return ret;
+    return drhd;
 }
 
 static void cf_check iommu_clear_root_pgtable(struct domain *d)
@@ -2149,16 +2194,17 @@ static int cf_check intel_iommu_enable_d
 
 static int cf_check intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
 {
+    const struct acpi_drhd_unit *drhd;
     struct acpi_rmrr_unit *rmrr;
     u16 bdf;
-    int ret, i;
+    unsigned int i;
 
     if ( !pdev->domain )
         return -EINVAL;
 
-    ret = domain_context_unmap(pdev->domain, devfn, pdev);
-    if ( ret )
-        return ret;
+    drhd = domain_context_unmap(pdev->domain, devfn, pdev);
+    if ( IS_ERR(drhd) )
+        return PTR_ERR(drhd);
 
     for_each_rmrr_device ( rmrr, bdf, i )
     {
@@ -2175,6 +2221,13 @@ static int cf_check intel_iommu_remove_d
                                rmrr->end_address, 0);
     }
 
+    if ( drhd )
+    {
+        iommu_free_domid(pdev->arch.pseudo_domid,
+                         drhd->iommu->pseudo_domid_map);
+        pdev->arch.pseudo_domid = DOMID_INVALID;
+    }
+
     return 0;
 }
 
@@ -2552,7 +2605,12 @@ static int cf_check reassign_device_owne
         }
     }
     else
-        ret = domain_context_unmap(source, devfn, pdev);
+    {
+        const struct acpi_drhd_unit *drhd;
+
+        drhd = domain_context_unmap(source, devfn, pdev);
+        ret = IS_ERR(drhd) ? PTR_ERR(drhd) : 0;
+    }
     if ( ret )
     {
         if ( !has_arch_pdevs(target) )
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -503,6 +503,7 @@ struct vtd_iommu {
     } flush;
 
     struct list_head ats_devices;
+    unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */
     unsigned long *domid_bitmap;  /* domain id bitmap */
     domid_t *domid_map;           /* domain id mapping array */
     uint32_t version;
--- a/xen/drivers/passthrough/x86/iommu.c
+++ b/xen/drivers/passthrough/x86/iommu.c
@@ -391,6 +391,66 @@ void __hwdom_init arch_iommu_hwdom_init(
         return;
 }
 
+void arch_pci_init_pdev(struct pci_dev *pdev)
+{
+    pdev->arch.pseudo_domid = DOMID_INVALID;
+}
+
+unsigned long *__init iommu_init_domid(domid_t reserve)
+{
+    unsigned long *map;
+
+    if ( !iommu_quarantine )
+        return ZERO_BLOCK_PTR;
+
+    BUILD_BUG_ON(DOMID_MASK * 2U >= UINT16_MAX);
+
+    map = xzalloc_array(unsigned long, BITS_TO_LONGS(UINT16_MAX - DOMID_MASK));
+    if ( map && reserve != DOMID_INVALID )
+    {
+        ASSERT(reserve > DOMID_MASK);
+        __set_bit(reserve & DOMID_MASK, map);
+    }
+
+    return map;
+}
+
+domid_t iommu_alloc_domid(unsigned long *map)
+{
+    /*
+     * This is used uniformly across all IOMMUs, such that on typical
+     * systems we wouldn't re-use the same ID very quickly (perhaps never).
+     */
+    static unsigned int start;
+    unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start);
+
+    ASSERT(pcidevs_locked());
+
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK);
+    if ( idx >= UINT16_MAX - DOMID_MASK )
+        return DOMID_INVALID;
+
+    __set_bit(idx, map);
+
+    start = idx + 1;
+
+    return idx | (DOMID_MASK + 1);
+}
+
+void iommu_free_domid(domid_t domid, unsigned long *map)
+{
+    ASSERT(pcidevs_locked());
+
+    if ( domid == DOMID_INVALID )
+        return;
+
+    ASSERT(domid > DOMID_MASK);
+
+    if ( !__test_and_clear_bit(domid & DOMID_MASK, map) )
+        BUG();
+}
+
 int iommu_free_pgtables(struct domain *d)
 {
     struct domain_iommu *hd = dom_iommu(d);
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: drop TLB flushes from quarantine_init() hooks

The page tables just created aren't hooked up yet anywhere, so there's
nothing that could be present in any TLB, and hence nothing to flush.
Dropping this flush is, at least on the VT-d side, a prereq to per-
device domain ID use when quarantining devices, as dom_io isn't going
to be assigned a DID anymore: The warning in get_iommu_did() would
trigger.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
v5: This (previously follow-on) patch is now part of this series.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -656,8 +656,6 @@ int __init cf_check amd_iommu_quarantine
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    amd_iommu_flush_all_pages(d);
-
     /* Pages leaked in failure case */
     return level ? -ENOMEM : 0;
 }
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2980,9 +2980,6 @@ static int __init cf_check intel_iommu_q
  out:
     spin_unlock(&hd->arch.mapping_lock);
 
-    if ( !rc )
-        rc = iommu_flush_iotlb_all(d);
-
     /* Pages may be leaked in failure case */
     return rc;
 }
From: Jan Beulich <jbeulich@suse.com>
Subject: AMD/IOMMU: abstract maximum number of page table levels

We will want to use the constant elsewhere.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---
v7: New.

--- a/xen/drivers/passthrough/amd/iommu-defs.h
+++ b/xen/drivers/passthrough/amd/iommu-defs.h
@@ -106,6 +106,7 @@ struct amd_iommu_dte {
     bool tv:1;
     unsigned int :5;
     unsigned int had:2;
+#define IOMMU_MAX_PT_LEVELS 6
     unsigned int paging_mode:3;
     uint64_t pt_root:40;
     bool ppr:1;
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -337,7 +337,7 @@ int amd_iommu_alloc_root(struct domain *
     return 0;
 }
 
-unsigned int __read_mostly amd_iommu_max_paging_mode = 6;
+unsigned int __read_mostly amd_iommu_max_paging_mode = IOMMU_MAX_PT_LEVELS;
 int __read_mostly amd_iommu_min_paging_mode = 1;
 
 static int cf_check amd_iommu_domain_init(struct domain *d)
From: Jan Beulich <jbeulich@suse.com>
Subject: IOMMU/x86: use per-device page tables for quarantining

Devices with RMRRs / unity mapped regions, due to it being unspecified
how/when these memory regions may be accessed, may not be left
disconnected from the mappings of these regions (as long as it's not
certain that the device has been fully quiesced). Hence even the page
tables used when quarantining such devices need to have mappings of
those regions. This implies installing page tables in the first place
even when not in scratch-page quarantining mode.

This is CVE-2022-26361 / part of XSA-400.

While for the purpose here it would be sufficient to have devices with
RMRRs / unity mapped regions use per-device page tables, extend this to
all devices (in scratch-page quarantining mode). This allows the leaf
pages to be mapped r/w, thus covering also memory writes (rather than
just reads) issued by non-quiescent devices.

Set up quarantine page tables as late as possible, yet early enough to
not encounter failure during de-assign. This means setup generally
happens in assign_device(), while (for now) the one in deassign_device()
is there mainly to be on the safe side.

As to the removal of QUARANTINE_SKIP() from domain_context_unmap_one():
I think this was never really needed there, as the function explicitly
deals with finding a non-present context entry. Leaving it there would
require propagating pgd_maddr into the function (like was done by "VT-d:
prepare for per-device quarantine page tables" for
domain_context_mapping_one()).

In VT-d's DID allocation function don't require the IOMMU lock to be
held anymore: All involved code paths hold pcidevs_lock, so this way we
avoid the need to acquire the IOMMU lock around the new call to
context_set_domain_id().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
v7: Add comment ahead of context_set_domain_id(). Use
    IOMMU_MAX_PT_LEVELS. Re-base.
v6: context_set_domain_id()'s ASSERT() change moved here (from earlier
    patch). Shorten BDF comparison in intel_iommu_quarantine_init(). Re-
    base over changes earlier in the series.
v5: Also use per-device (pseudo) domain ID. Suppress corruption warning
    for DomIO. Re-base over changes to earlier patches.
v4: Re-base over new earlier patch.
v3: No spinning in quarantine_init() hooks when transiently populating
    dom_io's page table root. quarantine_teardown() and arrangements for
    them redone. Re-base over changes earlier in the series. Extend a
    comment. Introduce DEVICE_PGTABLE(). Re-base.
v2: New.

--- a/xen/arch/x86/include/asm/pci.h
+++ b/xen/arch/x86/include/asm/pci.h
@@ -1,6 +1,8 @@
 #ifndef __X86_PCI_H__
 #define __X86_PCI_H__
 
+#include <xen/mm.h>
+
 #define CF8_BDF(cf8)     (  ((cf8) & 0x00ffff00) >> 8)
 #define CF8_ADDR_LO(cf8) (   (cf8) & 0x000000fc)
 #define CF8_ADDR_HI(cf8) (  ((cf8) & 0x0f000000) >> 16)
@@ -18,7 +20,18 @@ struct arch_pci_dev {
      * them don't race (de)initialization and hence don't strictly need any
      * locking.
      */
+    union {
+        /* Subset of struct arch_iommu's fields, to be used in dom_io. */
+        struct {
+            uint64_t pgd_maddr;
+        } vtd;
+        struct {
+            struct page_info *root_table;
+        } amd;
+    };
     domid_t pseudo_domid;
+    mfn_t leaf_mfn;
+    struct page_list_head pgtables_list;
 };
 
 int pci_conf_write_intercept(unsigned int seg, unsigned int bdf,
--- a/xen/drivers/passthrough/amd/iommu.h
+++ b/xen/drivers/passthrough/amd/iommu.h
@@ -237,7 +237,8 @@ int amd_iommu_init_late(void);
 int amd_iommu_update_ivrs_mapping_acpi(void);
 int cf_check iov_adjust_irq_affinities(void);
 
-int cf_check amd_iommu_quarantine_init(struct domain *d);
+int cf_check amd_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page);
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev);
 
 /* mapping functions */
 int __must_check cf_check amd_iommu_map_page(
--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -600,64 +600,138 @@ int cf_check amd_iommu_get_reserved_devi
     return 0;
 }
 
-int __init cf_check amd_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(union amd_iommu_pte *this, unsigned int level,
+                    struct page_info *pgs[IOMMU_MAX_PT_LEVELS])
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_PER_TABLE_SIZE; ++i )
+    {
+        union amd_iommu_pte *pte = &this[i], *next;
+
+        if ( !pte->pr )
+        {
+            if ( !pgs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                pgs[level] = iommu_alloc_pgtable(hd);
+                if ( !pgs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                if ( level )
+                {
+                    next = __map_domain_page(pgs[level]);
+                    rc = fill_qpt(next, level - 1, pgs);
+                    unmap_domain_page(next);
+                }
+            }
+
+            /*
+             * PDEs are essentially a subset of PTEs, so this function
+             * is fine to use even at the leaf.
+             */
+            set_iommu_pde_present(pte, mfn_x(page_to_mfn(pgs[level])), level,
+                                  true, true);
+        }
+        else if ( level && pte->next_level )
+        {
+            next = map_domain_page(_mfn(pte->mfn));
+            rc = fill_qpt(next, level - 1, pgs);
+            unmap_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+int cf_check amd_iommu_quarantine_init(struct pci_dev *pdev, bool scratch_page)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
     unsigned long end_gfn =
         1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT);
     unsigned int level = amd_iommu_get_paging_mode(end_gfn);
-    union amd_iommu_pte *table;
+    unsigned int req_id = get_dma_requestor_id(pdev->seg, pdev->sbdf.bdf);
+    const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg);
+    int rc;
+
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.amd.root_table);
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
 
-    if ( hd->arch.amd.root_table )
-    {
-        ASSERT_UNREACHABLE();
+    if ( !scratch_page && !ivrs_mappings[req_id].unity_map )
         return 0;
-    }
 
-    spin_lock(&hd->arch.mapping_lock);
+    ASSERT(pdev->arch.pseudo_domid != DOMID_INVALID);
 
-    hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
-    if ( !hd->arch.amd.root_table )
-        goto out;
+    if ( pdev->arch.amd.root_table )
+    {
+        clear_domain_page(pdev->arch.leaf_mfn);
+        return 0;
+    }
 
-    table = __map_domain_page(hd->arch.amd.root_table);
-    while ( level )
+    pdev->arch.amd.root_table = iommu_alloc_pgtable(hd);
+    if ( !pdev->arch.amd.root_table )
+        return -ENOMEM;
+
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
+    hd->arch.amd.root_table = pdev->arch.amd.root_table;
+
+    rc = amd_iommu_reserve_domain_unity_map(dom_io,
+                                            ivrs_mappings[req_id].unity_map,
+                                            0);
+
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.amd.root_table = NULL;
+
+    if ( rc )
+        AMD_IOMMU_WARN("%pp: quarantine unity mapping failed\n", &pdev->sbdf);
+    else if ( scratch_page )
     {
-        struct page_info *pg;
-        unsigned int i;
+        union amd_iommu_pte *root;
+        struct page_info *pgs[IOMMU_MAX_PT_LEVELS] = {};
 
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        pg = iommu_alloc_pgtable(hd);
-        if ( !pg )
-            break;
+        spin_lock(&hd->arch.mapping_lock);
 
-        for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ )
-        {
-            union amd_iommu_pte *pde = &table[i];
+        root = __map_domain_page(pdev->arch.amd.root_table);
+        rc = fill_qpt(root, level - 1, pgs);
+        unmap_domain_page(root);
 
-            /*
-             * PDEs are essentially a subset of PTEs, so this function
-             * is fine to use even at the leaf.
-             */
-            set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level - 1,
-                                  false, true);
-        }
+        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
 
-        unmap_domain_page(table);
-        table = __map_domain_page(pg);
-        level--;
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_domain_page(table);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list);
+
+    if ( rc )
+        amd_iommu_quarantine_teardown(pdev);
+
+    return rc;
+}
+
+void amd_iommu_quarantine_teardown(struct pci_dev *pdev)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.amd.root_table )
+        return;
 
-    /* Pages leaked in failure case */
-    return level ? -ENOMEM : 0;
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
+    page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list);
+    while ( iommu_free_pgtables(dom_io) == -ERESTART )
+        /* nothing */;
+    pdev->arch.amd.root_table = NULL;
 }
 
 /*
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -26,7 +26,7 @@
 #include "../ats.h"
 
 /* dom_io is used as a sentinel for quarantined devices */
-#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.amd.root_table)
+#define QUARANTINE_SKIP(d, p) ((d) == dom_io && !(p)->arch.amd.root_table)
 
 static bool_t __read_mostly init_done;
 
@@ -125,8 +125,10 @@ static int __must_check amd_iommu_setup_
     u8 bus = pdev->bus;
     struct domain_iommu *hd = dom_iommu(domain);
     const struct ivrs_mappings *ivrs_dev;
+    const struct page_info *root_pg;
+    domid_t domid;
 
-    if ( QUARANTINE_SKIP(domain) )
+    if ( QUARANTINE_SKIP(domain, pdev) )
         return 0;
 
     BUG_ON(!hd->arch.amd.paging_mode || !iommu->dev_table.buffer);
@@ -147,14 +149,25 @@ static int __must_check amd_iommu_setup_
     dte = &table[req_id];
     ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id];
 
+    if ( domain != dom_io )
+    {
+        root_pg = hd->arch.amd.root_table;
+        domid = domain->domain_id;
+    }
+    else
+    {
+        root_pg = pdev->arch.amd.root_table;
+        domid = pdev->arch.pseudo_domid;
+    }
+
     spin_lock_irqsave(&iommu->lock, flags);
 
     if ( !dte->v || !dte->tv )
     {
         /* bind DTE to domain page-tables */
         rc = amd_iommu_set_root_page_table(
-                 dte, page_to_maddr(hd->arch.amd.root_table),
-                 domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+                 dte, page_to_maddr(root_pg), domid,
+                 hd->arch.amd.paging_mode, sr_flags);
         if ( rc )
         {
             ASSERT(rc < 0);
@@ -181,7 +194,7 @@ static int __must_check amd_iommu_setup_
 
         amd_iommu_flush_device(iommu, req_id);
     }
-    else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.amd.root_table)) )
+    else if ( dte->pt_root != mfn_x(page_to_mfn(root_pg)) )
     {
         /*
          * Strictly speaking if the device is the only one with this requestor
@@ -194,8 +207,8 @@ static int __must_check amd_iommu_setup_
             rc = -EOPNOTSUPP;
         else
             rc = amd_iommu_set_root_page_table(
-                     dte, page_to_maddr(hd->arch.amd.root_table),
-                     domain->domain_id, hd->arch.amd.paging_mode, sr_flags);
+                     dte, page_to_maddr(root_pg), domid,
+                     hd->arch.amd.paging_mode, sr_flags);
         if ( rc < 0 )
         {
             spin_unlock_irqrestore(&iommu->lock, flags);
@@ -214,6 +227,7 @@ static int __must_check amd_iommu_setup_
               * intended anyway.
               */
              !pdev->domain->is_dying &&
+             pdev->domain != dom_io &&
              (any_pdev_behind_iommu(pdev->domain, pdev, iommu) ||
               pdev->phantom_stride) )
             AMD_IOMMU_WARN(" %pp: reassignment may cause %pd data corruption\n",
@@ -246,9 +260,8 @@ static int __must_check amd_iommu_setup_
     AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, "
                     "root table = %#"PRIx64", "
                     "domain = %d, paging mode = %d\n",
-                    req_id, pdev->type,
-                    page_to_maddr(hd->arch.amd.root_table),
-                    domain->domain_id, hd->arch.amd.paging_mode);
+                    req_id, pdev->type, page_to_maddr(root_pg),
+                    domid, hd->arch.amd.paging_mode);
 
     ASSERT(pcidevs_locked());
 
@@ -327,7 +340,7 @@ int amd_iommu_alloc_root(struct domain *
 {
     struct domain_iommu *hd = dom_iommu(d);
 
-    if ( unlikely(!hd->arch.amd.root_table) )
+    if ( unlikely(!hd->arch.amd.root_table) && d != dom_io )
     {
         hd->arch.amd.root_table = iommu_alloc_pgtable(hd);
         if ( !hd->arch.amd.root_table )
@@ -391,7 +404,7 @@ static void amd_iommu_disable_domain_dev
     int req_id;
     u8 bus = pdev->bus;
 
-    if ( QUARANTINE_SKIP(domain) )
+    if ( QUARANTINE_SKIP(domain, pdev) )
         return;
 
     ASSERT(pcidevs_locked());
@@ -430,7 +443,7 @@ static void amd_iommu_disable_domain_dev
 
         AMD_IOMMU_DEBUG("Disable: device id = %#x, "
                         "domain = %d, paging mode = %d\n",
-                        req_id,  domain->domain_id,
+                        req_id, dte->domain_id,
                         dom_iommu(domain)->arch.amd.paging_mode);
     }
     else
@@ -454,7 +467,7 @@ static int cf_check reassign_device(
         return -ENODEV;
     }
 
-    if ( !QUARANTINE_SKIP(target) )
+    if ( !QUARANTINE_SKIP(target, pdev) )
     {
         rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
         if ( rc )
@@ -655,6 +668,8 @@ static int cf_check amd_iommu_remove_dev
         AMD_IOMMU_WARN("%pd: unity unmapping failed for %pp\n",
                        pdev->domain, &pdev->sbdf);
 
+    amd_iommu_quarantine_teardown(pdev);
+
     iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map);
     pdev->arch.pseudo_domid = DOMID_INVALID;
 
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -466,21 +466,22 @@ int iommu_iotlb_flush_all(struct domain
     return rc;
 }
 
-static int __init iommu_quarantine_init(void)
+int iommu_quarantine_dev_init(device_t *dev)
 {
     const struct domain_iommu *hd = dom_iommu(dom_io);
-    int rc;
 
-    dom_io->options |= XEN_DOMCTL_CDF_iommu;
+    if ( !iommu_quarantine || !hd->platform_ops->quarantine_init )
+        return 0;
 
-    rc = iommu_domain_init(dom_io, 0);
-    if ( rc || iommu_quarantine < IOMMU_quarantine_scratch_page )
-        return rc;
+    return iommu_call(hd->platform_ops, quarantine_init,
+                      dev, iommu_quarantine == IOMMU_quarantine_scratch_page);
+}
 
-    if ( !hd->platform_ops->quarantine_init )
-        return 0;
+static int __init iommu_quarantine_init(void)
+{
+    dom_io->options |= XEN_DOMCTL_CDF_iommu;
 
-    return hd->platform_ops->quarantine_init(dom_io);
+    return iommu_domain_init(dom_io, 0);
 }
 
 int __init iommu_setup(void)
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -921,9 +921,16 @@ static int deassign_device(struct domain
         return -ENODEV;
 
     /* De-assignment from dom_io should de-quarantine the device */
-    target = ((pdev->quarantine || iommu_quarantine) &&
-              pdev->domain != dom_io) ?
-        dom_io : hardware_domain;
+    if ( (pdev->quarantine || iommu_quarantine) && pdev->domain != dom_io )
+    {
+        ret = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( ret )
+           return ret;
+
+        target = dom_io;
+    }
+    else
+        target = hardware_domain;
 
     while ( pdev->phantom_stride )
     {
@@ -1498,6 +1505,13 @@ static int assign_device(struct domain *
     if ( rc )
         goto done;
 
+    if ( pdev->domain != dom_io )
+    {
+        rc = iommu_quarantine_dev_init(pci_to_dev(pdev));
+        if ( rc )
+            goto done;
+    }
+
     pdev->fault.count = 0;
 
     if ( (rc = iommu_call(hd->platform_ops, assign_device, d, devfn,
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -45,6 +45,11 @@
 
 /* dom_io is used as a sentinel for quarantined devices */
 #define QUARANTINE_SKIP(d, pgd_maddr) ((d) == dom_io && !(pgd_maddr))
+#define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \
+                                             : (pdev)->arch.pseudo_domid)
+#define DEVICE_PGTABLE(d, pdev) ((d) != dom_io \
+                                 ? dom_iommu(d)->arch.vtd.pgd_maddr \
+                                 : (pdev)->arch.vtd.pgd_maddr)
 
 /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */
 bool __read_mostly untrusted_msi;
@@ -108,12 +113,17 @@ static int get_iommu_did(domid_t domid,
 
 #define DID_FIELD_WIDTH 16
 #define DID_HIGH_OFFSET 8
+
+/*
+ * This function may have "context" passed as NULL, to merely obtain a DID
+ * for "domid".
+ */
 static int context_set_domain_id(struct context_entry *context,
                                  domid_t domid, struct vtd_iommu *iommu)
 {
     unsigned int i;
 
-    ASSERT(spin_is_locked(&iommu->lock));
+    ASSERT(pcidevs_locked());
 
     if ( domid_mapping(iommu) )
     {
@@ -138,8 +148,12 @@ static int context_set_domain_id(struct
     else
         i = convert_domid(iommu, domid);
 
-    context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
-    context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+    if ( context )
+    {
+        context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET);
+        context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET;
+    }
+
     return 0;
 }
 
@@ -194,8 +208,12 @@ static void check_cleanup_domid_map(cons
                                     const struct pci_dev *exclude,
                                     struct vtd_iommu *iommu)
 {
-    bool found = any_pdev_behind_iommu(d, exclude, iommu);
+    bool found;
+
+    if ( d == dom_io )
+        return;
 
+    found = any_pdev_behind_iommu(d, exclude, iommu);
     /*
      * Hidden devices are associated with DomXEN but usable by the hardware
      * domain. Hence they need considering here as well.
@@ -212,7 +230,7 @@ static void check_cleanup_domid_map(cons
 
 domid_t did_to_domain_id(const struct vtd_iommu *iommu, unsigned int did)
 {
-    if ( did >= min(cap_ndoms(iommu->cap), DOMID_MASK + 1) )
+    if ( did >= cap_ndoms(iommu->cap) )
         return DOMID_INVALID;
 
     if ( !domid_mapping(iommu) )
@@ -1428,7 +1446,7 @@ int domain_context_mapping_one(
         domid = did_to_domain_id(iommu, prev_did);
         if ( domid < DOMID_FIRST_RESERVED )
             prev_dom = rcu_lock_domain_by_id(domid);
-        else if ( domid == DOMID_IO )
+        else if ( pdev ? domid == pdev->arch.pseudo_domid : domid > DOMID_MASK )
             prev_dom = rcu_lock_domain(dom_io);
         if ( !prev_dom )
         {
@@ -1585,15 +1603,12 @@ int domain_context_mapping_one(
     {
         if ( !prev_dom )
             ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                           domain->domain_id);
+                                           DEVICE_DOMID(domain, pdev));
         else if ( prev_dom != domain ) /* Avoid infinite recursion. */
-        {
-            hd = dom_iommu(prev_dom);
             ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev,
-                                             domain->domain_id,
-                                             hd->arch.vtd.pgd_maddr,
+                                             DEVICE_DOMID(prev_dom, pdev),
+                                             DEVICE_PGTABLE(prev_dom, pdev),
                                              mode & MAP_WITH_RMRR) < 0;
-        }
         else
             ret = 1;
 
@@ -1615,7 +1630,7 @@ static int domain_context_mapping(struct
 {
     const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
     const struct acpi_rmrr_unit *rmrr;
-    paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr;
+    paddr_t pgd_maddr = DEVICE_PGTABLE(domain, pdev);
     domid_t orig_domid = pdev->arch.pseudo_domid;
     int ret = 0;
     unsigned int i, mode = 0;
@@ -1644,7 +1659,7 @@ static int domain_context_mapping(struct
         break;
     }
 
-    if ( domain != pdev->domain )
+    if ( domain != pdev->domain && pdev->domain != dom_io )
     {
         if ( pdev->domain->is_dying )
             mode |= MAP_OWNER_DYING;
@@ -1686,8 +1701,8 @@ static int domain_context_mapping(struct
         if ( iommu_debug )
             printk(VTDPREFIX "%pd:PCIe: map %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
-        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
+        ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, pdev,
+                                         DEVICE_DOMID(domain, pdev), pgd_maddr,
                                          mode);
         if ( ret > 0 )
             ret = 0;
@@ -1713,8 +1728,8 @@ static int domain_context_mapping(struct
                    domain, &PCI_SBDF3(seg, bus, devfn));
 
         ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                         pdev, domain->domain_id, pgd_maddr,
-                                         mode);
+                                         pdev, DEVICE_DOMID(domain, pdev),
+                                         pgd_maddr, mode);
         if ( ret < 0 )
             break;
         prev_present = ret;
@@ -1742,8 +1757,8 @@ static int domain_context_mapping(struct
          */
         if ( ret >= 0 )
             ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         /*
          * Devices behind PCIe-to-PCI/PCIx bridge may generate different
@@ -1758,8 +1773,8 @@ static int domain_context_mapping(struct
         if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE &&
              (secbus != pdev->bus || pdev->devfn != 0) )
             ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0,
-                                             NULL, domain->domain_id, pgd_maddr,
-                                             mode);
+                                             NULL, DEVICE_DOMID(domain, pdev),
+                                             pgd_maddr, mode);
 
         if ( ret )
         {
@@ -1801,9 +1816,6 @@ int domain_context_unmap_one(
     int iommu_domid, rc, ret;
     bool_t flush_dev_iotlb;
 
-    if ( QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) )
-        return 0;
-
     ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
 
@@ -1905,7 +1917,7 @@ static const struct acpi_drhd_unit *doma
             printk(VTDPREFIX "%pd:PCIe: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 )
             disable_ats_device(pdev);
 
@@ -1919,7 +1931,7 @@ static const struct acpi_drhd_unit *doma
             printk(VTDPREFIX "%pd:PCI: unmap %pp\n",
                    domain, &PCI_SBDF3(seg, bus, devfn));
         ret = domain_context_unmap_one(domain, iommu, bus, devfn,
-                                       domain->domain_id);
+                                       DEVICE_DOMID(domain, pdev));
         if ( ret )
             break;
 
@@ -1942,18 +1954,12 @@ static const struct acpi_drhd_unit *doma
             break;
         }
 
+        ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
+                                       DEVICE_DOMID(domain, pdev));
         /* PCIe to PCI/PCIx bridge */
-        if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
-        {
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
-            if ( !ret )
-                ret = domain_context_unmap_one(domain, iommu, secbus, 0,
-                                               domain->domain_id);
-        }
-        else /* Legacy PCI bridge */
-            ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn,
-                                           domain->domain_id);
+        if ( !ret && pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE )
+            ret = domain_context_unmap_one(domain, iommu, secbus, 0,
+                                           DEVICE_DOMID(domain, pdev));
 
         break;
 
@@ -1964,7 +1970,7 @@ static const struct acpi_drhd_unit *doma
     }
 
     if ( !ret && pdev->devfn == devfn &&
-         !QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) )
+         !QUARANTINE_SKIP(domain, pdev->arch.vtd.pgd_maddr) )
         check_cleanup_domid_map(domain, pdev, iommu);
 
     return drhd;
@@ -1997,6 +2003,26 @@ static void cf_check iommu_domain_teardo
     XFREE(hd->arch.vtd.iommu_bitmap);
 }
 
+static void quarantine_teardown(struct pci_dev *pdev,
+                                const struct acpi_drhd_unit *drhd)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
+
+    ASSERT(pcidevs_locked());
+
+    if ( !pdev->arch.vtd.pgd_maddr )
+        return;
+
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
+    page_list_move(&hd->arch.pgtables.list, &pdev->arch.pgtables_list);
+    while ( iommu_free_pgtables(dom_io) == -ERESTART )
+        /* nothing */;
+    pdev->arch.vtd.pgd_maddr = 0;
+
+    if ( drhd )
+        cleanup_domid_map(pdev->arch.pseudo_domid, drhd->iommu);
+}
+
 static int __must_check cf_check intel_iommu_map_page(
     struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags,
     unsigned int *flush_flags)
@@ -2221,6 +2247,8 @@ static int cf_check intel_iommu_remove_d
                                rmrr->end_address, 0);
     }
 
+    quarantine_teardown(pdev, drhd);
+
     if ( drhd )
     {
         iommu_free_domid(pdev->arch.pseudo_domid,
@@ -2579,7 +2607,7 @@ static int cf_check reassign_device_owne
 {
     int ret;
 
-    if ( !QUARANTINE_SKIP(target, dom_iommu(target)->arch.vtd.pgd_maddr) )
+    if ( !QUARANTINE_SKIP(target, pdev->arch.vtd.pgd_maddr) )
     {
         if ( !has_arch_pdevs(target) )
             vmx_pi_hooks_assign(target);
@@ -2596,7 +2624,7 @@ static int cf_check reassign_device_owne
         ret = domain_context_mapping(target, devfn, pdev);
 
         if ( !ret && pdev->devfn == devfn &&
-             !QUARANTINE_SKIP(source, dom_iommu(source)->arch.vtd.pgd_maddr) )
+             !QUARANTINE_SKIP(source, pdev->arch.vtd.pgd_maddr) )
         {
             const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev);
 
@@ -2918,69 +2946,136 @@ static void cf_check vtd_dump_page_table
                               agaw_to_level(hd->arch.vtd.agaw), 0, 0);
 }
 
-static int __init cf_check intel_iommu_quarantine_init(struct domain *d)
+static int fill_qpt(struct dma_pte *this, unsigned int level,
+                    struct page_info *pgs[6])
 {
-    struct domain_iommu *hd = dom_iommu(d);
+    struct domain_iommu *hd = dom_iommu(dom_io);
+    unsigned int i;
+    int rc = 0;
+
+    for ( i = 0; !rc && i < PTE_NUM; ++i )
+    {
+        struct dma_pte *pte = &this[i], *next;
+
+        if ( !dma_pte_present(*pte) )
+        {
+            if ( !pgs[level] )
+            {
+                /*
+                 * The pgtable allocator is fine for the leaf page, as well as
+                 * page table pages, and the resulting allocations are always
+                 * zeroed.
+                 */
+                pgs[level] = iommu_alloc_pgtable(hd);
+                if ( !pgs[level] )
+                {
+                    rc = -ENOMEM;
+                    break;
+                }
+
+                if ( level )
+                {
+                    next = map_vtd_domain_page(page_to_maddr(pgs[level]));
+                    rc = fill_qpt(next, level - 1, pgs);
+                    unmap_vtd_domain_page(next);
+                }
+            }
+
+            dma_set_pte_addr(*pte, page_to_maddr(pgs[level]));
+            dma_set_pte_readable(*pte);
+            dma_set_pte_writable(*pte);
+        }
+        else if ( level && !dma_pte_superpage(*pte) )
+        {
+            next = map_vtd_domain_page(dma_pte_addr(*pte));
+            rc = fill_qpt(next, level - 1, pgs);
+            unmap_vtd_domain_page(next);
+        }
+    }
+
+    return rc;
+}
+
+static int cf_check intel_iommu_quarantine_init(struct pci_dev *pdev,
+                                                bool scratch_page)
+{
+    struct domain_iommu *hd = dom_iommu(dom_io);
     struct page_info *pg;
-    struct dma_pte *parent;
     unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH);
     unsigned int level = agaw_to_level(agaw);
-    int rc = 0;
+    const struct acpi_drhd_unit *drhd;
+    const struct acpi_rmrr_unit *rmrr;
+    unsigned int i, bdf;
+    bool rmrr_found = false;
+    int rc;
 
-    spin_lock(&hd->arch.mapping_lock);
+    ASSERT(pcidevs_locked());
+    ASSERT(!hd->arch.vtd.pgd_maddr);
+    ASSERT(page_list_empty(&hd->arch.pgtables.list));
 
-    if ( hd->arch.vtd.pgd_maddr )
+    if ( pdev->arch.vtd.pgd_maddr )
     {
-        ASSERT_UNREACHABLE();
-        goto out;
+        clear_domain_page(pdev->arch.leaf_mfn);
+        return 0;
     }
 
-    pg = iommu_alloc_pgtable(hd);
+    drhd = acpi_find_matched_drhd_unit(pdev);
+    if ( !drhd )
+        return -ENODEV;
 
-    rc = -ENOMEM;
+    pg = iommu_alloc_pgtable(hd);
     if ( !pg )
-        goto out;
+        return -ENOMEM;
+
+    rc = context_set_domain_id(NULL, pdev->arch.pseudo_domid, drhd->iommu);
 
+    /* Transiently install the root into DomIO, for iommu_identity_mapping(). */
     hd->arch.vtd.pgd_maddr = page_to_maddr(pg);
 
-    parent = map_vtd_domain_page(hd->arch.vtd.pgd_maddr);
-    while ( level )
+    for_each_rmrr_device ( rmrr, bdf, i )
     {
-        uint64_t maddr;
-        unsigned int offset;
-
-        /*
-         * The pgtable allocator is fine for the leaf page, as well as
-         * page table pages, and the resulting allocations are always
-         * zeroed.
-         */
-        pg = iommu_alloc_pgtable(hd);
-
-        if ( !pg )
-            goto out;
+        if ( rc )
+            break;
 
-        maddr = page_to_maddr(pg);
-        for ( offset = 0; offset < PTE_NUM; offset++ )
+        if ( rmrr->segment == pdev->seg && bdf == pdev->sbdf.bdf )
         {
-            struct dma_pte *pte = &parent[offset];
+            rmrr_found = true;
 
-            dma_set_pte_addr(*pte, maddr);
-            dma_set_pte_readable(*pte);
+            rc = iommu_identity_mapping(dom_io, p2m_access_rw,
+                                        rmrr->base_address, rmrr->end_address,
+                                        0);
+            if ( rc )
+                printk(XENLOG_ERR VTDPREFIX
+                       "%pp: RMRR quarantine mapping failed\n",
+                       &pdev->sbdf);
         }
-        iommu_sync_cache(parent, PAGE_SIZE);
+    }
 
-        unmap_vtd_domain_page(parent);
-        parent = map_vtd_domain_page(maddr);
-        level--;
+    iommu_identity_map_teardown(dom_io);
+    hd->arch.vtd.pgd_maddr = 0;
+    pdev->arch.vtd.pgd_maddr = page_to_maddr(pg);
+
+    if ( !rc && scratch_page )
+    {
+        struct dma_pte *root;
+        struct page_info *pgs[6] = {};
+
+        spin_lock(&hd->arch.mapping_lock);
+
+        root = map_vtd_domain_page(pdev->arch.vtd.pgd_maddr);
+        rc = fill_qpt(root, level - 1, pgs);
+        unmap_vtd_domain_page(root);
+
+        pdev->arch.leaf_mfn = page_to_mfn(pgs[0]);
+
+        spin_unlock(&hd->arch.mapping_lock);
     }
-    unmap_vtd_domain_page(parent);
 
-    rc = 0;
+    page_list_move(&pdev->arch.pgtables_list, &hd->arch.pgtables.list);
 
- out:
-    spin_unlock(&hd->arch.mapping_lock);
+    if ( rc || (!scratch_page && !rmrr_found) )
+        quarantine_teardown(pdev, drhd);
 
-    /* Pages may be leaked in failure case */
     return rc;
 }
 
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -477,7 +477,7 @@ struct vtd_iommu {
     u32 nr_pt_levels;
     u64	cap;
     u64	ecap;
-    spinlock_t lock; /* protect context, domain ids */
+    spinlock_t lock; /* protect context */
     spinlock_t register_lock; /* protect iommu register handling */
     u64 root_maddr; /* root entry machine address */
     nodeid_t node;
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -233,7 +233,7 @@ typedef int iommu_grdm_t(xen_pfn_t start
 struct iommu_ops {
     int (*init)(struct domain *d);
     void (*hwdom_init)(struct domain *d);
-    int (*quarantine_init)(struct domain *d);
+    int (*quarantine_init)(device_t *dev, bool scratch_page);
     int (*add_device)(u8 devfn, device_t *dev);
     int (*enable_device)(device_t *dev);
     int (*remove_device)(u8 devfn, device_t *dev);
@@ -349,6 +349,7 @@ int __must_check iommu_suspend(void);
 void iommu_resume(void);
 void iommu_crash_shutdown(void);
 int iommu_get_reserved_device_memory(iommu_grdm_t *, void *);
+int iommu_quarantine_dev_init(device_t *dev);
 
 #ifdef CONFIG_HAS_PCI
 int iommu_do_pci_domctl(struct xen_domctl *, struct domain *d,