Xen Security Advisory 410 v3 (CVE-2022-33746) - P2M pool freeing may take excessively long

Posted by Xen.org security team 3 years, 4 months ago
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

            Xen Security Advisory CVE-2022-33746 / XSA-410
                               version 3

              P2M pool freeing may take excessively long

UPDATES IN VERSION 3
====================

Public release.

ISSUE DESCRIPTION
=================

The P2M pool backing second level address translation for guests may be
of significant size.  Therefore its freeing may take more time than is
reasonable without intermediate preemption checks.  Such checking for
the need to preempt was so far missing.

IMPACT
======

A group of collaborating guests can cause the temporary locking up of a
CPU, potentially leading to a Denial of Service (DoS) affecting the
entire host.

VULNERABLE SYSTEMS
==================

All Xen versions are vulnerable.

x86 HVM and PVH guests as well as Arm guests can trigger the
vulnerability.  x86 PV guests cannot trigger the vulnerability.

MITIGATION
==========

Running only PV guests will avoid the vulnerability.

CREDITS
=======

This issue was discovered by Julien Grall of Amazon.

RESOLUTION
==========

Applying the appropriate set of attached patches resolves this issue.

Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball.  Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.

xsa410/xsa410-??.patch           xen-unstable
xsa410/xsa410-4.16-??.patch      Xen 4.16.x - 4.15.x
xsa410/xsa410-4.14-??.patch      Xen 4.14.x
xsa410/xsa410-4.13-??.patch      Xen 4.13.x

$ sha256sum xsa410* xsa410*/*
70b2f2c880b30094c9bdbd3ae4b20b32acfc8daf94d5add5884998ff20ffc0e7  xsa410.meta
632f4d71bc9dfc5ddcf649b1e484a918b4cb3d270dedad3b904bf4552318ae0d  xsa410/xsa410-01.patch
a2c1e6871a76b9d0c7f54b5557c6d0e1a02423bca5b27354aa7e872b0016047e  xsa410/xsa410-02.patch
61b8c71ad199dfa9762e739a592aa0a7f3b79d42e88d80a9589a993c768352be  xsa410/xsa410-03.patch
fb11b3d730bb665add2447b8f2258755604ce51e0ccc0731cddd938a538b051f  xsa410/xsa410-4.13-01.patch
ce5e780fdd162a1961fb0d51ccd7db8c3b2cedcee444ee3a58569bd8bbcfd6e8  xsa410/xsa410-4.13-02.patch
33514a6bf40d6c73fa7ca064b3e0401048f87eecbd007601bca6943b58f5c4b5  xsa410/xsa410-4.13-03.patch
af7d5eeda27e789c91e39b58110b25b668ecc241ed87bf4d75d9ff2bf647c660  xsa410/xsa410-4.13-04.patch
972e95787d635056bb0476bff990af0957d9669b4b4948975a74ed085b9fdc38  xsa410/xsa410-4.13-05.patch
4587ff1246f1ea59053e76cdded0e42aba8e747123c8b37b7fe4e03f39d3a447  xsa410/xsa410-4.13-06.patch
99a2a83ea89aa0a79c3cd938917d6b7de1e7e52ec744fb2e0ed1ed2a577cb203  xsa410/xsa410-4.13-07.patch
b36cc0d96111dbf65b7fefbce5fe9c5fe737dca24453f10f76253ce5bdcbb37d  xsa410/xsa410-4.13-08.patch
b548a1ba8082e5dbb35943bbacc5391766343c373c6edd2eb96d430cacdac00b  xsa410/xsa410-4.13-09.patch
9fae7cf66cb298737ad5f021c349291ec84f8de83d02a9b814967fb97b85ad1f  xsa410/xsa410-4.13-10.patch
0b91fcfc0a29428cfc06f4f1ddb01f5d1e7f144eae05635f2e9ef46dd7b33f0a  xsa410/xsa410-4.14-01.patch
a7a7e7e9529e91454035ad468c46faae34638be1f5f0694e1fe352c6c1acff06  xsa410/xsa410-4.14-02.patch
75bb2296a9f8adeb0ae3fc330f158614aab94a9263aba99730fe31d71be93d62  xsa410/xsa410-4.14-03.patch
8ad3dc1957fdb440e0bbd3b8e17286361ddfa6bb748ba6d48cc85ca8e88862ba  xsa410/xsa410-4.14-04.patch
5aba547158d8f182eb8a148a03c3c69741d264b568a80b349c34b99e36e75647  xsa410/xsa410-4.14-05.patch
5b343f47ce34c53a0cf300a05ccd6898f695e62ced4b0f14d64c9947c8c17250  xsa410/xsa410-4.14-06.patch
d34f3107061f13fdd1338d78544584d3509f8f7dabde78027f308c934cfeeb10  xsa410/xsa410-4.14-07.patch
8ccce0e109f6e0957643a04c822b7637b2cc7094ab73c4b19898657c05282f76  xsa410/xsa410-4.14-08.patch
ca3116eb10b4ea29a4e5ce97a40d0f504418a8cd890fa49fb4ddf6c3acba9a9b  xsa410/xsa410-4.14-09.patch
ec1ad7529e6406f7fff9ebe35caf64419e360feadc9fae4ea679bff88238eefa  xsa410/xsa410-4.14-10.patch
27857174e10917e02c6b9c6b8c29d5510c308035462a9a18bcdfebcef8c1e7af  xsa410/xsa410-4.16-01.patch
7fc330e398e99023f9875004409ae4cb3943b15338662c242887f593d909e271  xsa410/xsa410-4.16-02.patch
9a72aaef6a65ec984022590c5e1bb39527873df4607604746d0a0b91636271d8  xsa410/xsa410-4.16-03.patch
4dffbb2e5933c18426e6ce0cbba94c42637f59b8cec03aad2bfc54d81c49d3e3  xsa410/xsa410-4.16-04.patch
2e5d91e3e5e0e7a294caada1399e017487063642bbb42bddfa5169db6faab37e  xsa410/xsa410-4.16-05.patch
8174d9ed5f633f5a043084bf0cfb08211173f1afbfc5240c306bffa69c883595  xsa410/xsa410-4.16-06.patch
b78792bd0d51a8e18d570d225df556f2099272cab00f1cb95bbbb4c08d299ce1  xsa410/xsa410-4.16-07.patch
1f3f14bf3091e685cf6ac530baf7bd060586cf3db330ba1218d1048eb672d6eb  xsa410/xsa410-4.16-08.patch
63af35d559156436276967c94b3402982914b0fdd77187ff5b0bbf3dda356589  xsa410/xsa410-4.16-09.patch
85e8da807225df97583f5331491f29ecea059ce770c59a1a898a4b19b838f0c1  xsa410/xsa410-4.16-10.patch
6cf86d574ff45719659ed23af352fdc64d6563434057b733ac46ec6d5c758a3f  xsa410/xsa410-04.patch
296d38e69eebab2985cdab70419ca5fd73380d94b35c96fa7f6820fead59bf95  xsa410/xsa410-05.patch
e590762c70faad493b4e95c9f747ad9c3b313233f1b0aba3e81df5f40565cc51  xsa410/xsa410-06.patch
28164010d988fb590c7b22ef7f3571142660ec975ee8709f28fe310f220f7b08  xsa410/xsa410-07.patch
0ad43b452e5aef2657f311b6fa2fbc1eb07702d08c78878b1e614c573606feeb  xsa410/xsa410-08.patch
04f02d9b06f74a8921557196b39c2cf3dd8fd7bf0c1f350d0c55d8d49187e9a7  xsa410/xsa410-09.patch
a67ae39583867ed5d3900c4b45e2e32e9ac4ec58298c6508cedb273e9b7caf4b  xsa410/xsa410-10.patch
$

DEPLOYMENT DURING EMBARGO
=========================

Deployment of the patches and/or mitigations described above (or
others which are substantially similar) is permitted during the
embargo, even on public-facing systems with untrusted guest users and
administrators.

But: Distribution of updated software is prohibited (except to other
members of the predisclosure list).

Predisclosure list members who wish to deploy significantly different
patches and/or mitigations, please contact the Xen Project Security
Team.

(Note: this during-embargo deployment notice is retained in
post-embargo publicly released Xen Project advisories, even though it
is then no longer applicable.  This is to enable the community to have
oversight of the Xen Project Security Team's decisionmaking.)

For more information about permissible uses of embargoed information,
consult the Xen Project community's agreed Security Policy:
  http://www.xenproject.org/security-policy.html
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmNFS/4MHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZFn8H/AlU50r9Lk0QaxVbvuKVir3rVgP+QURgVeHMTcuj
UbNpjasPjQMbT9vzTPtIN+b59J0FwhWWZRIcZhYX6sPC/L9eAomUiFnVOe9Jmyec
cv0gpn/fWum850A9/cZ+F3wNNmgbHcm+uLvCWM11vO79kUMzKmCeDGguU5cgbmBo
hiNNL/mUEnu5QQn+jXolFCCA+CzlSJLg+tJwZn0il6dIf7z9d2yAxJRMUHF8s/c3
d23+6kTxLkfdnkGuwxkEVcSCaBN6YCGPaUy4AaQYzqPun/hcqGCsXCgK7X+iJIxq
36LWZLuqwAL80CQzEnMkgBNpqyQiudEwbZnBSMt0nzctg1g=
=EdsG
-----END PGP SIGNATURE-----
{
  "XSA": 410,
  "SupportedVersions": [
    "master",
    "4.16",
    "4.15",
    "4.14",
    "4.13"
  ],
  "Trees": [
    "xen"
  ],
  "Recipes": {
    "4.13": {
      "Recipes": {
        "xen": {
          "StableRef": "d8a693019845caa4e216bcac10f9501a814c99ae",
          "Prereqs": [],
          "Patches": [
            "xsa410/xsa410-4.13-*.patch"
          ]
        }
      }
    },
    "4.14": {
      "Recipes": {
        "xen": {
          "StableRef": "261b882f7704515a01f74589f57f0c1303e3b701",
          "Prereqs": [],
          "Patches": [
            "xsa410/xsa410-4.14-*.patch"
          ]
        }
      }
    },
    "4.15": {
      "Recipes": {
        "xen": {
          "StableRef": "df3395f6b2d759aba39fb67a7bc0fe49147c8b39",
          "Prereqs": [],
          "Patches": [
            "xsa410/xsa410-4.16-*.patch"
          ]
        }
      }
    },
    "4.16": {
      "Recipes": {
        "xen": {
          "StableRef": "89fe6d0edea841d1d2690cf3f5173e334c687823",
          "Prereqs": [],
          "Patches": [
            "xsa410/xsa410-4.16-*.patch"
          ]
        }
      }
    },
    "master": {
      "Recipes": {
        "xen": {
          "StableRef": "01ca29f0b17a50a94b0e232ba276c32e95d80ae3",
          "Prereqs": [],
          "Patches": [
            "xsa410/xsa410-?.patch"
          ]
        }
      }
    }
  }
}
From e74492aea8185b6529b2eeba990840a3bec56d12 Mon Sep 17 00:00:00 2001
From: Julien Grall <jgrall@amazon.com>
Date: Mon, 6 Jun 2022 06:17:25 +0000
Subject: [PATCH 1/2] xen/arm: p2m: Prevent adding mapping when domain is dying

During the domain destroy process, the domain will still be accessible
until it is fully destroyed. So does the P2M because we don't bail
out early if is_dying is non-zero. If a domain has permission to
modify the other domain's P2M (i.e. dom0, or a stubdomain), then
foreign mapping can be added past relinquish_p2m_mapping().

Therefore, we need to prevent mapping to be added when the domain
is dying. This commit prevents such adding of mapping by adding the
d->is_dying check to p2m_set_entry(). Also this commit enhances the
check in relinquish_p2m_mapping() to make sure that no mappings can
be added in the P2M after the P2M lock is released.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Tested-by: Henry Wang <Henry.Wang@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
---
 xen/arch/arm/p2m.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index 8449f97fe7e4..c2e0b116c467 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1092,6 +1092,15 @@ int p2m_set_entry(struct p2m_domain *p2m,
 {
     int rc = 0;
 
+    /*
+     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
+     * be dropped in relinquish_p2m_mapping(). As the P2M will still
+     * be accessible after, we need to prevent mapping to be added when the
+     * domain is dying.
+     */
+    if ( unlikely(p2m->domain->is_dying) )
+        return -ENOMEM;
+
     while ( nr )
     {
         unsigned long mask;
@@ -1634,6 +1643,8 @@ int relinquish_p2m_mapping(struct domain *d)
     unsigned int order;
     gfn_t start, end;
 
+    BUG_ON(!d->is_dying);
+    /* No mappings can be added in the P2M after the P2M lock is released. */
     p2m_write_lock(p2m);
 
     start = p2m->lowest_mapped_gfn;
-- 
2.37.1

From 5be736475ce13801bea28f2e05e121f2dd31a34a Mon Sep 17 00:00:00 2001
From: Julien Grall <jgrall@amazon.com>
Date: Mon, 6 Jun 2022 06:17:26 +0000
Subject: [PATCH 2/2] xen/arm: p2m: Handle preemption when freeing intermediate
 page tables

At the moment the P2M page tables will be freed when the domain structure
is freed without any preemption. As the P2M is quite large, iterating
through this may take more time than it is reasonable without intermediate
preemption (to run softirqs and perhaps scheduler).

Split p2m_teardown() in two parts: one preemptible and called when
relinquishing the resources, the other one non-preemptible and called
when freeing the domain structure.

As we are now freeing the P2M pages early, we also need to prevent
further allocation if someone call p2m_set_entry() past p2m_teardown()
(I wasn't able to prove this will never happen). This is done by
the checking domain->is_dying from previous patch in p2m_set_entry().

Similarly, we want to make sure that no-one can accessed the free
pages. Therefore the root is cleared before freeing pages.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Tested-by: Henry Wang <Henry.Wang@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
---
 xen/arch/arm/domain.c          | 10 ++++++--
 xen/arch/arm/include/asm/p2m.h | 13 ++++++++--
 xen/arch/arm/p2m.c             | 47 +++++++++++++++++++++++++++++++---
 3 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 2cd481979cf1..e5ae3e71eb20 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -797,10 +797,10 @@ fail:
 void arch_domain_destroy(struct domain *d)
 {
     /* IOMMU page table is shared with P2M, always call
-     * iommu_domain_destroy() before p2m_teardown().
+     * iommu_domain_destroy() before p2m_final_teardown().
      */
     iommu_domain_destroy(d);
-    p2m_teardown(d);
+    p2m_final_teardown(d);
     domain_vgic_free(d);
     domain_vuart_free(d);
     free_xenheap_page(d->shared_info);
@@ -1004,6 +1004,7 @@ enum {
     PROG_xen,
     PROG_page,
     PROG_mapping,
+    PROG_p2m,
     PROG_done,
 };
 
@@ -1064,6 +1065,11 @@ int domain_relinquish_resources(struct domain *d)
         if ( ret )
             return ret;
 
+    PROGRESS(p2m):
+        ret = p2m_teardown(d);
+        if ( ret )
+            return ret;
+
     PROGRESS(done):
         break;
 
diff --git a/xen/arch/arm/include/asm/p2m.h b/xen/arch/arm/include/asm/p2m.h
index 8cce459b67ba..a15ea67f9b48 100644
--- a/xen/arch/arm/include/asm/p2m.h
+++ b/xen/arch/arm/include/asm/p2m.h
@@ -192,8 +192,17 @@ void setup_virt_paging(void);
 /* Init the datastructures for later use by the p2m code */
 int p2m_init(struct domain *d);
 
-/* Return all the p2m resources to Xen. */
-void p2m_teardown(struct domain *d);
+/*
+ * The P2M resources are freed in two parts:
+ *  - p2m_teardown() will be called when relinquish the resources. It
+ *    will free large resources (e.g. intermediate page-tables) that
+ *    requires preemption.
+ *  - p2m_final_teardown() will be called when domain struct is been
+ *    freed. This *cannot* be preempted and therefore one small
+ *    resources should be freed here.
+ */
+int p2m_teardown(struct domain *d);
+void p2m_final_teardown(struct domain *d);
 
 /*
  * Remove mapping refcount on each mapping page in the p2m
diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index c2e0b116c467..b445f4d7541e 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1551,17 +1551,58 @@ static void p2m_free_vmid(struct domain *d)
     spin_unlock(&vmid_alloc_lock);
 }
 
-void p2m_teardown(struct domain *d)
+int p2m_teardown(struct domain *d)
 {
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    unsigned long count = 0;
     struct page_info *pg;
+    unsigned int i;
+    int rc = 0;
+
+    p2m_write_lock(p2m);
+
+    /*
+     * We are about to free the intermediate page-tables, so clear the
+     * root to prevent any walk to use them.
+     */
+    for ( i = 0; i < P2M_ROOT_PAGES; i++ )
+        clear_and_clean_page(p2m->root + i);
+
+    /*
+     * The domain will not be scheduled anymore, so in theory we should
+     * not need to flush the TLBs. Do it for safety purpose.
+     *
+     * Note that all the devices have already been de-assigned. So we don't
+     * need to flush the IOMMU TLB here.
+     */
+    p2m_force_tlb_flush_sync(p2m);
+
+    while ( (pg = page_list_remove_head(&p2m->pages)) )
+    {
+        free_domheap_page(pg);
+        count++;
+        /* Arbitrarily preempt every 512 iterations */
+        if ( !(count % 512) && hypercall_preempt_check() )
+        {
+            rc = -ERESTART;
+            break;
+        }
+    }
+
+    p2m_write_unlock(p2m);
+
+    return rc;
+}
+
+void p2m_final_teardown(struct domain *d)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
     /* p2m not actually initialized */
     if ( !p2m->domain )
         return;
 
-    while ( (pg = page_list_remove_head(&p2m->pages)) )
-        free_domheap_page(pg);
+    ASSERT(page_list_empty(&p2m->pages));
 
     if ( p2m->root )
         free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
-- 
2.37.1

From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: add option to skip root pagetable removal in p2m_teardown()

Add a new parameter to p2m_teardown() in order to select whether the
root page table should also be freed.  Note that all users are
adjusted to pass the parameter to remove the root page tables, so
behavior is not modified.

No functional change intended.

This is part of CVE-2022-33746 / XSA-410.

Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
---
Changes since v1:
 - New in this version.

--- a/xen/arch/x86/include/asm/p2m.h
+++ b/xen/arch/x86/include/asm/p2m.h
@@ -600,7 +600,7 @@ int p2m_init(struct domain *d);
 int p2m_alloc_table(struct p2m_domain *p2m);
 
 /* Return all the p2m resources to Xen. */
-void p2m_teardown(struct p2m_domain *p2m);
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root);
 void p2m_final_teardown(struct domain *d);
 
 /* Add/remove a page to/from a domain's p2m table. */
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -541,18 +541,18 @@ void hap_final_teardown(struct domain *d
         }
 
         for ( i = 0; i < MAX_ALTP2M; i++ )
-            p2m_teardown(d->arch.altp2m_p2m[i]);
+            p2m_teardown(d->arch.altp2m_p2m[i], true);
     }
 
     /* Destroy nestedp2m's first */
     for (i = 0; i < MAX_NESTEDP2M; i++) {
-        p2m_teardown(d->arch.nested_p2m[i]);
+        p2m_teardown(d->arch.nested_p2m[i], true);
     }
 
     if ( d->arch.paging.hap.total_pages != 0 )
         hap_teardown(d, NULL);
 
-    p2m_teardown(p2m_get_hostp2m(d));
+    p2m_teardown(p2m_get_hostp2m(d), true);
     /* Free any memory that the p2m teardown released */
     paging_lock(d);
     hap_set_allocation(d, 0, NULL);
--- a/xen/arch/x86/mm/p2m-basic.c
+++ b/xen/arch/x86/mm/p2m-basic.c
@@ -154,10 +154,10 @@ int p2m_init(struct domain *d)
  * hvm fixme: when adding support for pvh non-hardware domains, this path must
  * cleanup any foreign p2m types (release refcnts on them).
  */
-void p2m_teardown(struct p2m_domain *p2m)
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root)
 {
 #ifdef CONFIG_HVM
-    struct page_info *pg;
+    struct page_info *pg, *root_pg = NULL;
     struct domain *d;
 
     if ( !p2m )
@@ -171,10 +171,20 @@ void p2m_teardown(struct p2m_domain *p2m
     ASSERT(atomic_read(&d->shr_pages) == 0);
 #endif
 
-    p2m->phys_table = pagetable_null();
+    if ( remove_root )
+        p2m->phys_table = pagetable_null();
+    else if ( !pagetable_is_null(p2m->phys_table) )
+    {
+        root_pg = pagetable_get_page(p2m->phys_table);
+        clear_domain_page(pagetable_get_mfn(p2m->phys_table));
+    }
 
     while ( (pg = page_list_remove_head(&p2m->pages)) )
-        d->arch.paging.free_page(d, pg);
+        if ( pg != root_pg )
+            d->arch.paging.free_page(d, pg);
+
+    if ( root_pg )
+        page_list_add(root_pg, &p2m->pages);
 
     p2m_unlock(p2m);
 #endif
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2707,7 +2707,7 @@ int shadow_enable(struct domain *d, u32
  out_unlocked:
 #ifdef CONFIG_HVM
     if ( rv != 0 && !pagetable_is_null(p2m_get_pagetable(p2m)) )
-        p2m_teardown(p2m);
+        p2m_teardown(p2m, true);
 #endif
     if ( rv != 0 && pg != NULL )
     {
@@ -2873,7 +2873,7 @@ void shadow_final_teardown(struct domain
         shadow_teardown(d, NULL);
 
     /* It is now safe to pull down the p2m map. */
-    p2m_teardown(p2m_get_hostp2m(d));
+    p2m_teardown(p2m_get_hostp2m(d), true);
     /* Free any shadow memory that the p2m teardown released */
     paging_lock(d);
     shadow_set_allocation(d, 0, NULL);
From be1a42f496f8d51fc36c9a67ad1f32e3c77c69af Mon Sep 17 00:00:00 2001
From: Julien Grall <jgrall@amazon.com>
Date: Mon, 6 Jun 2022 06:17:25 +0000
Subject: [PATCH 1/2] xen/arm: p2m: Prevent adding mapping when domain is dying

During the domain destroy process, the domain will still be accessible
until it is fully destroyed. So does the P2M because we don't bail
out early if is_dying is non-zero. If a domain has permission to
modify the other domain's P2M (i.e. dom0, or a stubdomain), then
foreign mapping can be added past relinquish_p2m_mapping().

Therefore, we need to prevent mapping to be added when the domain
is dying. This commit prevents such adding of mapping by adding the
d->is_dying check to p2m_set_entry(). Also this commit enhances the
check in relinquish_p2m_mapping() to make sure that no mappings can
be added in the P2M after the P2M lock is released.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Tested-by: Henry Wang <Henry.Wang@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
---
 xen/arch/arm/p2m.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index 993fe4ded212..ff745776380b 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1089,6 +1089,15 @@ int p2m_set_entry(struct p2m_domain *p2m,
 {
     int rc = 0;
 
+    /*
+     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
+     * be dropped in relinquish_p2m_mapping(). As the P2M will still
+     * be accessible after, we need to prevent mapping to be added when the
+     * domain is dying.
+     */
+    if ( unlikely(p2m->domain->is_dying) )
+        return -ENOMEM;
+
     while ( nr )
     {
         unsigned long mask;
@@ -1578,6 +1587,8 @@ int relinquish_p2m_mapping(struct domain *d)
     unsigned int order;
     gfn_t start, end;
 
+    BUG_ON(!d->is_dying);
+    /* No mappings can be added in the P2M after the P2M lock is released. */
     p2m_write_lock(p2m);
 
     start = p2m->lowest_mapped_gfn;
-- 
2.37.1

From 00e16f4b70c47c09602234ccff1dcec5ca1ecbfc Mon Sep 17 00:00:00 2001
From: Julien Grall <jgrall@amazon.com>
Date: Mon, 6 Jun 2022 06:17:26 +0000
Subject: [PATCH 2/2] xen/arm: p2m: Handle preemption when freeing intermediate
 page tables

At the moment the P2M page tables will be freed when the domain structure
is freed without any preemption. As the P2M is quite large, iterating
through this may take more time than it is reasonable without intermediate
preemption (to run softirqs and perhaps scheduler).

Split p2m_teardown() in two parts: one preemptible and called when
relinquishing the resources, the other one non-preemptible and called
when freeing the domain structure.

As we are now freeing the P2M pages early, we also need to prevent
further allocation if someone call p2m_set_entry() past p2m_teardown()
(I wasn't able to prove this will never happen). This is done by
the checking domain->is_dying from previous patch in p2m_set_entry().

Similarly, we want to make sure that no-one can accessed the free
pages. Therefore the root is cleared before freeing pages.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Tested-by: Henry Wang <Henry.Wang@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
---
 xen/arch/arm/domain.c        | 12 +++++++--
 xen/arch/arm/p2m.c           | 47 +++++++++++++++++++++++++++++++++---
 xen/include/asm-arm/domain.h |  1 +
 xen/include/asm-arm/p2m.h    | 13 ++++++++--
 4 files changed, 66 insertions(+), 7 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index ddeccb992cf6..1e24a7dbb4a1 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -775,10 +775,10 @@ fail:
 void arch_domain_destroy(struct domain *d)
 {
     /* IOMMU page table is shared with P2M, always call
-     * iommu_domain_destroy() before p2m_teardown().
+     * iommu_domain_destroy() before p2m_final_teardown().
      */
     iommu_domain_destroy(d);
-    p2m_teardown(d);
+    p2m_final_teardown(d);
     domain_vgic_free(d);
     domain_vuart_free(d);
     free_xenheap_page(d->shared_info);
@@ -1014,6 +1014,14 @@ int domain_relinquish_resources(struct domain *d)
         if ( ret )
             return ret;
 
+        d->arch.relmem = RELMEM_p2m;
+        /* Fallthrough */
+
+    case RELMEM_p2m:
+        ret = p2m_teardown(d);
+        if ( ret )
+            return ret;
+
         d->arch.relmem = RELMEM_done;
         /* Fallthrough */
 
diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index ff745776380b..42638787a295 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1495,17 +1495,58 @@ static void p2m_free_vmid(struct domain *d)
     spin_unlock(&vmid_alloc_lock);
 }
 
-void p2m_teardown(struct domain *d)
+int p2m_teardown(struct domain *d)
 {
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    unsigned long count = 0;
     struct page_info *pg;
+    unsigned int i;
+    int rc = 0;
+
+    p2m_write_lock(p2m);
+
+    /*
+     * We are about to free the intermediate page-tables, so clear the
+     * root to prevent any walk to use them.
+     */
+    for ( i = 0; i < P2M_ROOT_PAGES; i++ )
+        clear_and_clean_page(p2m->root + i);
+
+    /*
+     * The domain will not be scheduled anymore, so in theory we should
+     * not need to flush the TLBs. Do it for safety purpose.
+     *
+     * Note that all the devices have already been de-assigned. So we don't
+     * need to flush the IOMMU TLB here.
+     */
+    p2m_force_tlb_flush_sync(p2m);
+
+    while ( (pg = page_list_remove_head(&p2m->pages)) )
+    {
+        free_domheap_page(pg);
+        count++;
+        /* Arbitrarily preempt every 512 iterations */
+        if ( !(count % 512) && hypercall_preempt_check() )
+        {
+            rc = -ERESTART;
+            break;
+        }
+    }
+
+    p2m_write_unlock(p2m);
+
+    return rc;
+}
+
+void p2m_final_teardown(struct domain *d)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
     /* p2m not actually initialized */
     if ( !p2m->domain )
         return;
 
-    while ( (pg = page_list_remove_head(&p2m->pages)) )
-        free_domheap_page(pg);
+    ASSERT(page_list_empty(&p2m->pages));
 
     if ( p2m->root )
         free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h
index f1776c6c0899..9b44a9648c50 100644
--- a/xen/include/asm-arm/domain.h
+++ b/xen/include/asm-arm/domain.h
@@ -62,6 +62,7 @@ struct arch_domain
         RELMEM_xen,
         RELMEM_page,
         RELMEM_mapping,
+        RELMEM_p2m,
         RELMEM_done,
     } relmem;
 
diff --git a/xen/include/asm-arm/p2m.h b/xen/include/asm-arm/p2m.h
index 5fdb6e818348..20df6212712e 100644
--- a/xen/include/asm-arm/p2m.h
+++ b/xen/include/asm-arm/p2m.h
@@ -171,8 +171,17 @@ void setup_virt_paging(void);
 /* Init the datastructures for later use by the p2m code */
 int p2m_init(struct domain *d);
 
-/* Return all the p2m resources to Xen. */
-void p2m_teardown(struct domain *d);
+/*
+ * The P2M resources are freed in two parts:
+ *  - p2m_teardown() will be called when relinquish the resources. It
+ *    will free large resources (e.g. intermediate page-tables) that
+ *    requires preemption.
+ *  - p2m_final_teardown() will be called when domain struct is been
+ *    freed. This *cannot* be preempted and therefore one small
+ *    resources should be freed here.
+ */
+int p2m_teardown(struct domain *d);
+void p2m_final_teardown(struct domain *d);
 
 /*
  * Remove mapping refcount on each mapping page in the p2m
-- 
2.37.1

From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: add option to skip root pagetable removal in p2m_teardown()

Add a new parameter to p2m_teardown() in order to select whether the
root page table should also be freed.  Note that all users are
adjusted to pass the parameter to remove the root page tables, so
behavior is not modified.

No functional change intended.

This is part of CVE-2022-33746 / XSA-410.

Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -599,7 +599,7 @@ int p2m_init(struct domain *d);
 int p2m_alloc_table(struct p2m_domain *p2m);
 
 /* Return all the p2m resources to Xen. */
-void p2m_teardown(struct p2m_domain *p2m);
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root);
 void p2m_final_teardown(struct domain *d);
 
 /* Add a page to a domain's p2m table */
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -521,18 +521,18 @@ void hap_final_teardown(struct domain *d
         }
 
         for ( i = 0; i < MAX_ALTP2M; i++ )
-            p2m_teardown(d->arch.altp2m_p2m[i]);
+            p2m_teardown(d->arch.altp2m_p2m[i], true);
     }
 
     /* Destroy nestedp2m's first */
     for (i = 0; i < MAX_NESTEDP2M; i++) {
-        p2m_teardown(d->arch.nested_p2m[i]);
+        p2m_teardown(d->arch.nested_p2m[i], true);
     }
 
     if ( d->arch.paging.hap.total_pages != 0 )
         hap_teardown(d, NULL);
 
-    p2m_teardown(p2m_get_hostp2m(d));
+    p2m_teardown(p2m_get_hostp2m(d), true);
     /* Free any memory that the p2m teardown released */
     paging_lock(d);
     hap_set_allocation(d, 0, NULL);
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -737,11 +737,11 @@ int p2m_alloc_table(struct p2m_domain *p
  * hvm fixme: when adding support for pvh non-hardware domains, this path must
  * cleanup any foreign p2m types (release refcnts on them).
  */
-void p2m_teardown(struct p2m_domain *p2m)
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root)
 /* Return all the p2m pages to Xen.
  * We know we don't have any extra mappings to these pages */
 {
-    struct page_info *pg;
+    struct page_info *pg, *root_pg = NULL;
     struct domain *d;
 
     if (p2m == NULL)
@@ -751,10 +751,22 @@ void p2m_teardown(struct p2m_domain *p2m
 
     p2m_lock(p2m);
     ASSERT(atomic_read(&d->shr_pages) == 0);
-    p2m->phys_table = pagetable_null();
+
+    if ( remove_root )
+        p2m->phys_table = pagetable_null();
+    else if ( !pagetable_is_null(p2m->phys_table) )
+    {
+        root_pg = pagetable_get_page(p2m->phys_table);
+        clear_domain_page(pagetable_get_mfn(p2m->phys_table));
+    }
 
     while ( (pg = page_list_remove_head(&p2m->pages)) )
-        d->arch.paging.free_page(d, pg);
+        if ( pg != root_pg )
+            d->arch.paging.free_page(d, pg);
+
+    if ( root_pg )
+        page_list_add(root_pg, &p2m->pages);
+
     p2m_unlock(p2m);
 }
 
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2684,7 +2684,7 @@ int shadow_enable(struct domain *d, u32
     paging_unlock(d);
  out_unlocked:
     if ( rv != 0 && !pagetable_is_null(p2m_get_pagetable(p2m)) )
-        p2m_teardown(p2m);
+        p2m_teardown(p2m, true);
     if ( rv != 0 && pg != NULL )
     {
         pg->count_info &= ~PGC_count_mask;
@@ -2835,7 +2835,7 @@ void shadow_final_teardown(struct domain
         shadow_teardown(d, NULL);
 
     /* It is now safe to pull down the p2m map. */
-    p2m_teardown(p2m_get_hostp2m(d));
+    p2m_teardown(p2m_get_hostp2m(d), true);
     /* Free any shadow memory that the p2m teardown released */
     paging_lock(d);
     shadow_set_allocation(d, 0, NULL);
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/HAP: adjust monitor table related error handling

hap_make_monitor_table() will return INVALID_MFN if it encounters an
error condition, but hap_update_paging_modes() wasn’t handling this
value, resulting in an inappropriate value being stored in
monitor_table. This would subsequently misguide at least
hap_vcpu_teardown(). Avoid this by bailing early.

Further, when a domain has/was already crashed or (perhaps less
important as there's no such path known to lead here) is already dying,
avoid calling domain_crash() on it again - that's at best confusing.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -39,6 +39,7 @@
 #include <asm/domain.h>
 #include <xen/numa.h>
 #include <asm/hvm/nestedhvm.h>
+#include <public/sched.h>
 
 #include "private.h"
 
@@ -405,8 +406,13 @@ static mfn_t hap_make_monitor_table(stru
     return m4mfn;
 
  oom:
-    printk(XENLOG_G_ERR "out of memory building monitor pagetable\n");
-    domain_crash(d);
+    if ( !d->is_dying &&
+         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
+    {
+        printk(XENLOG_G_ERR "%pd: out of memory building monitor pagetable\n",
+               d);
+        domain_crash(d);
+    }
     return INVALID_MFN;
 }
 
@@ -693,6 +699,9 @@ static void hap_update_paging_modes(stru
     if ( pagetable_is_null(v->arch.monitor_table) )
     {
         mfn_t mmfn = hap_make_monitor_table(v);
+
+        if ( mfn_eq(mmfn, INVALID_MFN) )
+            goto unlock;
         v->arch.monitor_table = pagetable_from_mfn(mmfn);
         make_cr3(v, mmfn);
         hvm_update_host_cr3(v);
@@ -701,6 +710,7 @@ static void hap_update_paging_modes(stru
     /* CR3 is effectively updated by a mode change. Flush ASIDs, etc. */
     hap_update_cr3(v, 0, false);
 
+ unlock:
     paging_unlock(d);
     put_gfn(d, cr3_gfn);
 }
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/shadow: tolerate failure of sh_set_toplevel_shadow()

Subsequently sh_set_toplevel_shadow() will be adjusted to install a
blank entry in case prealloc fails. There are, in fact, pre-existing
error paths which would put in place a blank entry. The 4- and 2-level
code in sh_update_cr3(), however, assume the top level entry to be
valid.

Hence bail from the function in the unlikely event that it's not. Note
that 3-level logic works differently: In particular a guest is free to
supply a PDPTR pointing at 4 non-present (or otherwise deemed invalid)
entries. The guest will crash, but we already cope with that.

Really mfn_valid() is likely wrong to use in sh_set_toplevel_shadow(),
and it should instead be !mfn_eq(gmfn, INVALID_MFN). Avoid such a change
in security context, but add a respective assertion.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3861,6 +3861,7 @@ sh_set_toplevel_shadow(struct vcpu *v,
     /* Now figure out the new contents: is this a valid guest MFN? */
     if ( !mfn_valid(gmfn) )
     {
+        ASSERT(mfn_eq(gmfn, INVALID_MFN));
         new_entry = pagetable_null();
         goto install_new_entry;
     }
@@ -4014,6 +4015,11 @@ sh_update_cr3(struct vcpu *v, int do_loc
     if ( sh_remove_write_access(d, gmfn, 2, 0) != 0 )
         flush_tlb_mask(d->dirty_cpumask);
     sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l2_shadow);
+    if ( unlikely(pagetable_is_null(v->arch.shadow_table[0])) )
+    {
+        ASSERT(d->is_dying || d->is_shutting_down);
+        return;
+    }
 #elif GUEST_PAGING_LEVELS == 3
     /* PAE guests have four shadow_table entries, based on the
      * current values of the guest's four l3es. */
@@ -4059,6 +4065,11 @@ sh_update_cr3(struct vcpu *v, int do_loc
     if ( sh_remove_write_access(d, gmfn, 4, 0) != 0 )
         flush_tlb_mask(d->dirty_cpumask);
     sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l4_shadow);
+    if ( unlikely(pagetable_is_null(v->arch.shadow_table[0])) )
+    {
+        ASSERT(d->is_dying || d->is_shutting_down);
+        return;
+    }
     if ( !shadow_mode_external(d) && !is_pv_32bit_domain(d) )
     {
         mfn_t smfn = pagetable_get_mfn(v->arch.shadow_table[0]);
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/shadow: tolerate failure in shadow_prealloc()

Prevent _shadow_prealloc() from calling BUG() when unable to fulfill
the pre-allocation and instead return true/false.  Modify
shadow_prealloc() to crash the domain on allocation failure (if the
domain is not already dying), as shadow cannot operate normally after
that.  Modify callers to also gracefully handle {_,}shadow_prealloc()
failing to fulfill the request.

Note this in turn requires adjusting the callers of
sh_make_monitor_table() also to handle it returning INVALID_MFN.
sh_update_paging_modes() is also modified to add additional error
paths in case of allocation failure, some of those will return with
null monitor page tables (and the domain likely crashed).  This is no
different that current error paths, but the newly introduced ones are
more likely to trigger.

The now added failure points in sh_update_paging_modes() also require
that on some error return paths the previous structures are cleared,
and thus monitor table is null.

While there adjust the 'type' parameter type of shadow_prealloc() to
unsigned int rather than u32.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -36,6 +36,7 @@
 #include <asm/shadow.h>
 #include <asm/hvm/ioreq.h>
 #include <xen/numa.h>
+#include <public/sched.h>
 #include "private.h"
 
 DEFINE_PER_CPU(uint32_t,trace_shadow_path_flags);
@@ -896,14 +897,15 @@ static inline void trace_shadow_prealloc
 
 /* Make sure there are at least count order-sized pages
  * available in the shadow page pool. */
-static void _shadow_prealloc(struct domain *d, unsigned int pages)
+static bool __must_check _shadow_prealloc(struct domain *d, unsigned int pages)
 {
     struct vcpu *v;
     struct page_info *sp, *t;
     mfn_t smfn;
     int i;
 
-    if ( d->arch.paging.shadow.free_pages >= pages ) return;
+    if ( d->arch.paging.shadow.free_pages >= pages )
+        return true;
 
     /* Shouldn't have enabled shadows if we've no vcpus. */
     ASSERT(d->vcpu && d->vcpu[0]);
@@ -919,7 +921,8 @@ static void _shadow_prealloc(struct doma
         sh_unpin(d, smfn);
 
         /* See if that freed up enough space */
-        if ( d->arch.paging.shadow.free_pages >= pages ) return;
+        if ( d->arch.paging.shadow.free_pages >= pages )
+            return true;
     }
 
     /* Stage two: all shadow pages are in use in hierarchies that are
@@ -940,7 +943,7 @@ static void _shadow_prealloc(struct doma
                 if ( d->arch.paging.shadow.free_pages >= pages )
                 {
                     flush_tlb_mask(d->dirty_cpumask);
-                    return;
+                    return true;
                 }
             }
         }
@@ -953,7 +956,12 @@ static void _shadow_prealloc(struct doma
            d->arch.paging.shadow.total_pages,
            d->arch.paging.shadow.free_pages,
            d->arch.paging.shadow.p2m_pages);
-    BUG();
+
+    ASSERT(d->is_dying);
+
+    flush_tlb_mask(d->dirty_cpumask);
+
+    return false;
 }
 
 /* Make sure there are at least count pages of the order according to
@@ -961,9 +969,19 @@ static void _shadow_prealloc(struct doma
  * This must be called before any calls to shadow_alloc().  Since this
  * will free existing shadows to make room, it must be called early enough
  * to avoid freeing shadows that the caller is currently working on. */
-void shadow_prealloc(struct domain *d, u32 type, unsigned int count)
+bool shadow_prealloc(struct domain *d, unsigned int type, unsigned int count)
 {
-    return _shadow_prealloc(d, shadow_size(type) * count);
+    bool ret = _shadow_prealloc(d, shadow_size(type) * count);
+
+    if ( !ret && !d->is_dying &&
+         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
+        /*
+         * Failing to allocate memory required for shadow usage can only result in
+         * a domain crash, do it here rather that relying on every caller to do it.
+         */
+        domain_crash(d);
+
+    return ret;
 }
 
 /* Deliberately free all the memory we can: this will tear down all of
@@ -1186,7 +1204,7 @@ void shadow_free(struct domain *d, mfn_t
 static struct page_info *
 shadow_alloc_p2m_page(struct domain *d)
 {
-    struct page_info *pg;
+    struct page_info *pg = NULL;
 
     /* This is called both from the p2m code (which never holds the
      * paging lock) and the log-dirty code (which always does). */
@@ -1204,16 +1222,18 @@ shadow_alloc_p2m_page(struct domain *d)
                     d->arch.paging.shadow.p2m_pages,
                     shadow_min_acceptable_pages(d));
         }
-        paging_unlock(d);
-        return NULL;
+        goto out;
     }
 
-    shadow_prealloc(d, SH_type_p2m_table, 1);
+    if ( !shadow_prealloc(d, SH_type_p2m_table, 1) )
+        goto out;
+
     pg = mfn_to_page(shadow_alloc(d, SH_type_p2m_table, 0));
     d->arch.paging.shadow.p2m_pages++;
     d->arch.paging.shadow.total_pages--;
     ASSERT(!page_get_owner(pg) && !(pg->count_info & PGC_count_mask));
 
+ out:
     paging_unlock(d);
 
     return pg;
@@ -1304,7 +1324,9 @@ int shadow_set_allocation(struct domain
         else if ( d->arch.paging.shadow.total_pages > pages )
         {
             /* Need to return memory to domheap */
-            _shadow_prealloc(d, 1);
+            if ( !_shadow_prealloc(d, 1) )
+                return -ENOMEM;
+
             sp = page_list_remove_head(&d->arch.paging.shadow.freelist);
             ASSERT(sp);
             /*
@@ -2396,12 +2418,13 @@ static void sh_update_paging_modes(struc
     if ( mfn_eq(v->arch.paging.shadow.oos_snapshot[0], INVALID_MFN) )
     {
         int i;
+
+        if ( !shadow_prealloc(d, SH_type_oos_snapshot, SHADOW_OOS_PAGES) )
+            return;
+
         for(i = 0; i < SHADOW_OOS_PAGES; i++)
-        {
-            shadow_prealloc(d, SH_type_oos_snapshot, 1);
             v->arch.paging.shadow.oos_snapshot[i] =
                 shadow_alloc(d, SH_type_oos_snapshot, 0);
-        }
     }
 #endif /* OOS */
 
@@ -2463,6 +2486,10 @@ static void sh_update_paging_modes(struc
         if ( pagetable_is_null(v->arch.monitor_table) )
         {
             mfn_t mmfn = v->arch.paging.mode->shadow.make_monitor_table(v);
+
+            if ( mfn_eq(mmfn, INVALID_MFN) )
+                return;
+
             v->arch.monitor_table = pagetable_from_mfn(mmfn);
             make_cr3(v, mmfn);
             hvm_update_host_cr3(v);
@@ -2500,6 +2527,11 @@ static void sh_update_paging_modes(struc
                 old_mfn = pagetable_get_mfn(v->arch.monitor_table);
                 v->arch.monitor_table = pagetable_null();
                 new_mfn = v->arch.paging.mode->shadow.make_monitor_table(v);
+                if ( mfn_eq(new_mfn, INVALID_MFN) )
+                {
+                    old_mode->shadow.destroy_monitor_table(v, old_mfn);
+                    return;
+                }
                 v->arch.monitor_table = pagetable_from_mfn(new_mfn);
                 SHADOW_PRINTK("new monitor table %"PRI_mfn "\n",
                                mfn_x(new_mfn));
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -1524,7 +1524,8 @@ sh_make_monitor_table(struct vcpu *v)
     ASSERT(pagetable_get_pfn(v->arch.monitor_table) == 0);
 
     /* Guarantee we can get the memory we need */
-    shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS);
+    if ( !shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS) )
+        return INVALID_MFN;
 
     {
         mfn_t m4mfn;
@@ -3052,9 +3053,14 @@ static int sh_page_fault(struct vcpu *v,
      * Preallocate shadow pages *before* removing writable accesses
      * otherwhise an OOS L1 might be demoted and promoted again with
      * writable mappings. */
-    shadow_prealloc(d,
-                    SH_type_l1_shadow,
-                    GUEST_PAGING_LEVELS < 4 ? 1 : GUEST_PAGING_LEVELS - 1);
+    if ( !shadow_prealloc(d, SH_type_l1_shadow,
+                          GUEST_PAGING_LEVELS < 4
+                          ? 1 : GUEST_PAGING_LEVELS - 1) )
+    {
+        paging_unlock(d);
+        put_gfn(d, gfn_x(gfn));
+        return 0;
+    }
 
     rc = gw_remove_write_accesses(v, va, &gw);
 
@@ -3871,7 +3877,12 @@ sh_set_toplevel_shadow(struct vcpu *v,
     if ( !mfn_valid(smfn) )
     {
         /* Make sure there's enough free shadow memory. */
-        shadow_prealloc(d, root_type, 1);
+        if ( !shadow_prealloc(d, root_type, 1) )
+        {
+            new_entry = pagetable_null();
+            goto install_new_entry;
+        }
+
         /* Shadow the page. */
         smfn = sh_make_shadow(v, gmfn, root_type);
     }
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -347,7 +347,8 @@ void shadow_promote(struct domain *d, mf
 void shadow_demote(struct domain *d, mfn_t gmfn, u32 type);
 
 /* Shadow page allocation functions */
-void  shadow_prealloc(struct domain *d, u32 shadow_type, unsigned int count);
+bool __must_check shadow_prealloc(struct domain *d, unsigned int shadow_type,
+                                  unsigned int count);
 mfn_t shadow_alloc(struct domain *d,
                     u32 shadow_type,
                     unsigned long backpointer);
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: refuse new allocations for dying domains

This will in particular prevent any attempts to add entries to the p2m,
once - in a subsequent change - non-root entries have been removed.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -244,6 +244,9 @@ static struct page_info *hap_alloc(struc
 
     ASSERT(paging_locked_by_me(d));
 
+    if ( unlikely(d->is_dying) )
+        return NULL;
+
     pg = page_list_remove_head(&d->arch.paging.hap.freelist);
     if ( unlikely(!pg) )
         return NULL;
@@ -280,7 +283,7 @@ static struct page_info *hap_alloc_p2m_p
         d->arch.paging.hap.p2m_pages++;
         ASSERT(!page_get_owner(pg) && !(pg->count_info & PGC_count_mask));
     }
-    else if ( !d->arch.paging.p2m_alloc_failed )
+    else if ( !d->arch.paging.p2m_alloc_failed && !d->is_dying )
     {
         d->arch.paging.p2m_alloc_failed = 1;
         dprintk(XENLOG_ERR, "d%i failed to allocate from HAP pool\n",
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -907,6 +907,10 @@ static bool __must_check _shadow_preallo
     if ( d->arch.paging.shadow.free_pages >= pages )
         return true;
 
+    if ( unlikely(d->is_dying) )
+        /* No reclaim when the domain is dying, teardown will take care of it. */
+        return false;
+
     /* Shouldn't have enabled shadows if we've no vcpus. */
     ASSERT(d->vcpu && d->vcpu[0]);
 
@@ -957,7 +961,7 @@ static bool __must_check _shadow_preallo
            d->arch.paging.shadow.free_pages,
            d->arch.paging.shadow.p2m_pages);
 
-    ASSERT(d->is_dying);
+    ASSERT_UNREACHABLE();
 
     flush_tlb_mask(d->dirty_cpumask);
 
@@ -971,10 +975,13 @@ static bool __must_check _shadow_preallo
  * to avoid freeing shadows that the caller is currently working on. */
 bool shadow_prealloc(struct domain *d, unsigned int type, unsigned int count)
 {
-    bool ret = _shadow_prealloc(d, shadow_size(type) * count);
+    bool ret;
+
+    if ( unlikely(d->is_dying) )
+       return false;
 
-    if ( !ret && !d->is_dying &&
-         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
+    ret = _shadow_prealloc(d, shadow_size(type) * count);
+    if ( !ret && (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
         /*
          * Failing to allocate memory required for shadow usage can only result in
          * a domain crash, do it here rather that relying on every caller to do it.
@@ -1206,6 +1213,9 @@ shadow_alloc_p2m_page(struct domain *d)
 {
     struct page_info *pg = NULL;
 
+    if ( unlikely(d->is_dying) )
+       return NULL;
+
     /* This is called both from the p2m code (which never holds the
      * paging lock) and the log-dirty code (which always does). */
     paging_lock_recursive(d);
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: truly free paging pool memory for dying domains

Modify {hap,shadow}_free to free the page immediately if the domain is
dying, so that pages don't accumulate in the pool when
{shadow,hap}_final_teardown() get called. This is to limit the amount of
work which needs to be done there (in a non-preemptable manner).

Note the call to shadow_free() in shadow_free_p2m_page() is moved after
increasing total_pages, so that the decrease done in shadow_free() in
case the domain is dying doesn't underflow the counter, even if just for
a short interval.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -264,6 +264,18 @@ static void hap_free(struct domain *d, m
 
     ASSERT(paging_locked_by_me(d));
 
+    /*
+     * For dying domains, actually free the memory here. This way less work is
+     * left to hap_final_teardown(), which cannot easily have preemption checks
+     * added.
+     */
+    if ( unlikely(d->is_dying) )
+    {
+        free_domheap_page(pg);
+        d->arch.paging.hap.total_pages--;
+        return;
+    }
+
     d->arch.paging.hap.free_pages++;
     page_list_add_tail(pg, &d->arch.paging.hap.freelist);
 }
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1155,6 +1155,7 @@ mfn_t shadow_alloc(struct domain *d,
 void shadow_free(struct domain *d, mfn_t smfn)
 {
     struct page_info *next = NULL, *sp = mfn_to_page(smfn);
+    bool dying = ACCESS_ONCE(d->is_dying);
     struct page_list_head *pin_list;
     unsigned int pages;
     u32 shadow_type;
@@ -1197,11 +1198,32 @@ void shadow_free(struct domain *d, mfn_t
          * just before the allocator hands the page out again. */
         page_set_tlbflush_timestamp(sp);
         perfc_decr(shadow_alloc_count);
-        page_list_add_tail(sp, &d->arch.paging.shadow.freelist);
+
+        /*
+         * For dying domains, actually free the memory here. This way less
+         * work is left to shadow_final_teardown(), which cannot easily have
+         * preemption checks added.
+         */
+        if ( unlikely(dying) )
+        {
+            /*
+             * The backpointer field (sh.back) used by shadow code aliases the
+             * domain owner field, unconditionally clear it here to avoid
+             * free_domheap_page() attempting to parse it.
+             */
+            page_set_owner(sp, NULL);
+            free_domheap_page(sp);
+        }
+        else
+            page_list_add_tail(sp, &d->arch.paging.shadow.freelist);
+
         sp = next;
     }
 
-    d->arch.paging.shadow.free_pages += pages;
+    if ( unlikely(dying) )
+        d->arch.paging.shadow.total_pages -= pages;
+    else
+        d->arch.paging.shadow.free_pages += pages;
 }
 
 /* Divert a page from the pool to be used by the p2m mapping.
@@ -1271,9 +1293,9 @@ shadow_free_p2m_page(struct domain *d, s
      * paging lock) and the log-dirty code (which always does). */
     paging_lock_recursive(d);
 
-    shadow_free(d, page_to_mfn(pg));
     d->arch.paging.shadow.p2m_pages--;
     d->arch.paging.shadow.total_pages++;
+    shadow_free(d, page_to_mfn(pg));
 
     paging_unlock(d);
 }
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: free the paging memory pool preemptively

The paging memory pool is currently freed in two different places:
from {shadow,hap}_teardown() via domain_relinquish_resources() and
from {shadow,hap}_final_teardown() via complete_domain_destroy().
While the former does handle preemption, the later doesn't.

Attempt to move as much p2m related freeing as possible to happen
before the call to {shadow,hap}_teardown(), so that most memory can be
freed in a preemptive way.  In order to avoid causing issues to
existing callers leave the root p2m page tables set and free them in
{hap,shadow}_final_teardown().  Also modify {hap,shadow}_free to free
the page immediately if the domain is dying, so that pages don't
accumulate in the pool when {shadow,hap}_final_teardown() get called.

Move altp2m_vcpu_disable_ve() to be done in hap_teardown(), as that's
the place where altp2m_active gets disabled now.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -38,7 +38,6 @@
 #include <xen/livepatch.h>
 #include <public/sysctl.h>
 #include <public/hvm/hvm_vcpu.h>
-#include <asm/altp2m.h>
 #include <asm/regs.h>
 #include <asm/mc146818rtc.h>
 #include <asm/system.h>
@@ -2098,12 +2097,6 @@ int domain_relinquish_resources(struct d
             vpmu_destroy(v);
         }
 
-        if ( altp2m_active(d) )
-        {
-            for_each_vcpu ( d, v )
-                altp2m_vcpu_disable_ve(v);
-        }
-
         if ( is_pv_domain(d) )
         {
             for_each_vcpu ( d, v )
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -28,6 +28,7 @@
 #include <xen/domain_page.h>
 #include <xen/guest_access.h>
 #include <xen/keyhandler.h>
+#include <asm/altp2m.h>
 #include <asm/event.h>
 #include <asm/page.h>
 #include <asm/current.h>
@@ -532,18 +533,8 @@ void hap_final_teardown(struct domain *d
     unsigned int i;
 
     if ( hvm_altp2m_supported() )
-    {
-        d->arch.altp2m_active = 0;
-
-        if ( d->arch.altp2m_eptp )
-        {
-            free_xenheap_page(d->arch.altp2m_eptp);
-            d->arch.altp2m_eptp = NULL;
-        }
-
         for ( i = 0; i < MAX_ALTP2M; i++ )
             p2m_teardown(d->arch.altp2m_p2m[i], true);
-    }
 
     /* Destroy nestedp2m's first */
     for (i = 0; i < MAX_NESTEDP2M; i++) {
@@ -558,6 +549,8 @@ void hap_final_teardown(struct domain *d
     paging_lock(d);
     hap_set_allocation(d, 0, NULL);
     ASSERT(d->arch.paging.hap.p2m_pages == 0);
+    ASSERT(d->arch.paging.hap.free_pages == 0);
+    ASSERT(d->arch.paging.hap.total_pages == 0);
     paging_unlock(d);
 }
 
@@ -565,6 +558,7 @@ void hap_teardown(struct domain *d, bool
 {
     struct vcpu *v;
     mfn_t mfn;
+    unsigned int i;
 
     ASSERT(d->is_dying);
     ASSERT(d != current->domain);
@@ -586,6 +580,31 @@ void hap_teardown(struct domain *d, bool
         }
     }
 
+    paging_unlock(d);
+
+    /* Leave the root pt in case we get further attempts to modify the p2m. */
+    if ( hvm_altp2m_supported() )
+    {
+        if ( altp2m_active(d) )
+            for_each_vcpu ( d, v )
+                altp2m_vcpu_disable_ve(v);
+
+        d->arch.altp2m_active = 0;
+
+        FREE_XENHEAP_PAGE(d->arch.altp2m_eptp);
+
+        for ( i = 0; i < MAX_ALTP2M; i++ )
+            p2m_teardown(d->arch.altp2m_p2m[i], false);
+    }
+
+    /* Destroy nestedp2m's after altp2m. */
+    for ( i = 0; i < MAX_NESTEDP2M; i++ )
+        p2m_teardown(d->arch.nested_p2m[i], false);
+
+    p2m_teardown(p2m_get_hostp2m(d), false);
+
+    paging_lock(d);
+
     if ( d->arch.paging.hap.total_pages != 0 )
     {
         hap_set_allocation(d, 0, preempted);
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2791,6 +2791,19 @@ void shadow_teardown(struct domain *d, b
         }
     }
 
+    paging_unlock(d);
+
+    p2m_teardown(p2m_get_hostp2m(d), false);
+
+    paging_lock(d);
+
+    /*
+     * Reclaim all shadow memory so that shadow_set_allocation() doesn't find
+     * in-use pages, as _shadow_prealloc() will no longer try to reclaim pages
+     * because the domain is dying.
+     */
+    shadow_blow_tables(d);
+
 #if (SHADOW_OPTIMIZATIONS & (SHOPT_VIRTUAL_TLB|SHOPT_OUT_OF_SYNC))
     /* Free the virtual-TLB array attached to each vcpu */
     for_each_vcpu(d, v)
@@ -2909,6 +2922,9 @@ void shadow_final_teardown(struct domain
                    d->arch.paging.shadow.total_pages,
                    d->arch.paging.shadow.free_pages,
                    d->arch.paging.shadow.p2m_pages);
+    ASSERT(!d->arch.paging.shadow.total_pages);
+    ASSERT(!d->arch.paging.shadow.free_pages);
+    ASSERT(!d->arch.paging.shadow.p2m_pages);
     paging_unlock(d);
 }
 
From: Julien Grall <jgrall@amazon.com>
Subject: xen/x86: p2m: Add preemption in p2m_teardown()

The list p2m->pages contain all the pages used by the P2M. On large
instance this can be quite large and the time spent to call
d->arch.paging.free_page() will take more than 1ms for a 80GB guest
on a Xen running in nested environment on a c5.metal.

By extrapolation, it would take > 100ms for a 8TB guest (what we
current security support). So add some preemption in p2m_teardown()
and propagate to the callers. Note there are 3 places where
the preemption is not enabled:
    - hap_final_teardown()/shadow_final_teardown(): We are
      preventing update the P2M once the domain is dying (so
      no more pages could be allocated) and most of the P2M pages
      will be freed in preemptive manneer when relinquishing the
      resources. So this is fine to disable preemption.
    - shadow_enable(): This is fine because it will undo the allocation
      that may have been made by p2m_alloc_table() (so only the root
      page table).

The preemption is arbitrarily checked every 1024 iterations.

Note that with the current approach, Xen doesn't keep track on whether
the alt/nested P2Ms have been cleared. So there are some redundant work.
However, this is not expected to incurr too much overhead (the P2M lock
shouldn't be contended during teardown). So this is optimization is
left outside of the security event.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -599,7 +599,7 @@ int p2m_init(struct domain *d);
 int p2m_alloc_table(struct p2m_domain *p2m);
 
 /* Return all the p2m resources to Xen. */
-void p2m_teardown(struct p2m_domain *p2m, bool remove_root);
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root, bool *preempted);
 void p2m_final_teardown(struct domain *d);
 
 /* Add a page to a domain's p2m table */
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -534,17 +534,17 @@ void hap_final_teardown(struct domain *d
 
     if ( hvm_altp2m_supported() )
         for ( i = 0; i < MAX_ALTP2M; i++ )
-            p2m_teardown(d->arch.altp2m_p2m[i], true);
+            p2m_teardown(d->arch.altp2m_p2m[i], true, NULL);
 
     /* Destroy nestedp2m's first */
     for (i = 0; i < MAX_NESTEDP2M; i++) {
-        p2m_teardown(d->arch.nested_p2m[i], true);
+        p2m_teardown(d->arch.nested_p2m[i], true, NULL);
     }
 
     if ( d->arch.paging.hap.total_pages != 0 )
         hap_teardown(d, NULL);
 
-    p2m_teardown(p2m_get_hostp2m(d), true);
+    p2m_teardown(p2m_get_hostp2m(d), true, NULL);
     /* Free any memory that the p2m teardown released */
     paging_lock(d);
     hap_set_allocation(d, 0, NULL);
@@ -594,14 +594,24 @@ void hap_teardown(struct domain *d, bool
         FREE_XENHEAP_PAGE(d->arch.altp2m_eptp);
 
         for ( i = 0; i < MAX_ALTP2M; i++ )
-            p2m_teardown(d->arch.altp2m_p2m[i], false);
+        {
+            p2m_teardown(d->arch.altp2m_p2m[i], false, preempted);
+            if ( preempted && *preempted )
+                return;
+        }
     }
 
     /* Destroy nestedp2m's after altp2m. */
     for ( i = 0; i < MAX_NESTEDP2M; i++ )
-        p2m_teardown(d->arch.nested_p2m[i], false);
+    {
+        p2m_teardown(d->arch.nested_p2m[i], false, preempted);
+        if ( preempted && *preempted )
+            return;
+    }
 
-    p2m_teardown(p2m_get_hostp2m(d), false);
+    p2m_teardown(p2m_get_hostp2m(d), false, preempted);
+    if ( preempted && *preempted )
+        return;
 
     paging_lock(d);
 
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -737,12 +737,13 @@ int p2m_alloc_table(struct p2m_domain *p
  * hvm fixme: when adding support for pvh non-hardware domains, this path must
  * cleanup any foreign p2m types (release refcnts on them).
  */
-void p2m_teardown(struct p2m_domain *p2m, bool remove_root)
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root, bool *preempted)
 /* Return all the p2m pages to Xen.
  * We know we don't have any extra mappings to these pages */
 {
     struct page_info *pg, *root_pg = NULL;
     struct domain *d;
+    unsigned int i = 0;
 
     if (p2m == NULL)
         return;
@@ -761,8 +762,19 @@ void p2m_teardown(struct p2m_domain *p2m
     }
 
     while ( (pg = page_list_remove_head(&p2m->pages)) )
-        if ( pg != root_pg )
-            d->arch.paging.free_page(d, pg);
+    {
+        if ( pg == root_pg )
+            continue;
+
+        d->arch.paging.free_page(d, pg);
+
+        /* Arbitrarily check preemption every 1024 iterations */
+        if ( preempted && !(++i % 1024) && general_preempt_check() )
+        {
+            *preempted = true;
+            break;
+        }
+    }
 
     if ( root_pg )
         page_list_add(root_pg, &p2m->pages);
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2747,8 +2747,12 @@ int shadow_enable(struct domain *d, u32
  out_locked:
     paging_unlock(d);
  out_unlocked:
+    /*
+     * This is fine to ignore the preemption here because only the root
+     * will be allocated by p2m_alloc_table().
+     */
     if ( rv != 0 && !pagetable_is_null(p2m_get_pagetable(p2m)) )
-        p2m_teardown(p2m, true);
+        p2m_teardown(p2m, true, NULL);
     if ( rv != 0 && pg != NULL )
     {
         pg->count_info &= ~PGC_count_mask;
@@ -2793,7 +2797,9 @@ void shadow_teardown(struct domain *d, b
 
     paging_unlock(d);
 
-    p2m_teardown(p2m_get_hostp2m(d), false);
+    p2m_teardown(p2m_get_hostp2m(d), false, preempted);
+    if ( preempted && *preempted )
+        return;
 
     paging_lock(d);
 
@@ -2912,7 +2918,7 @@ void shadow_final_teardown(struct domain
         shadow_teardown(d, NULL);
 
     /* It is now safe to pull down the p2m map. */
-    p2m_teardown(p2m_get_hostp2m(d), true);
+    p2m_teardown(p2m_get_hostp2m(d), true, NULL);
     /* Free any shadow memory that the p2m teardown released */
     paging_lock(d);
     shadow_set_allocation(d, 0, NULL);
From: Julien Grall <jgrall@amazon.com>
Subject: xen/arm: p2m: Prevent adding mapping when domain is dying

During the domain destroy process, the domain will still be accessible
until it is fully destroyed. So does the P2M because we don't bail
out early if is_dying is non-zero. If a domain has permission to
modify the other domain's P2M (i.e. dom0, or a stubdomain), then
foreign mapping can be added past relinquish_p2m_mapping().

Therefore, we need to prevent mapping to be added when the domain
is dying. This commit prevents such adding of mapping by adding the
d->is_dying check to p2m_set_entry(). Also this commit enhances the
check in relinquish_p2m_mapping() to make sure that no mappings can
be added in the P2M after the P2M lock is released.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Tested-by: Henry Wang <Henry.Wang@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1085,6 +1085,15 @@ int p2m_set_entry(struct p2m_domain *p2m
 {
     int rc = 0;
 
+    /*
+     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
+     * be dropped in relinquish_p2m_mapping(). As the P2M will still
+     * be accessible after, we need to prevent mapping to be added when the
+     * domain is dying.
+     */
+    if ( unlikely(p2m->domain->is_dying) )
+        return -ENOMEM;
+
     while ( nr )
     {
         unsigned long mask;
@@ -1579,6 +1588,8 @@ int relinquish_p2m_mapping(struct domain
     unsigned int order;
     gfn_t start, end;
 
+    BUG_ON(!d->is_dying);
+    /* No mappings can be added in the P2M after the P2M lock is released. */
     p2m_write_lock(p2m);
 
     start = p2m->lowest_mapped_gfn;
From: Julien Grall <jgrall@amazon.com>
Subject: xen/arm: p2m: Handle preemption when freeing intermediate page tables

At the moment the P2M page tables will be freed when the domain structure
is freed without any preemption. As the P2M is quite large, iterating
through this may take more time than it is reasonable without intermediate
preemption (to run softirqs and perhaps scheduler).

Split p2m_teardown() in two parts: one preemptible and called when
relinquishing the resources, the other one non-preemptible and called
when freeing the domain structure.

As we are now freeing the P2M pages early, we also need to prevent
further allocation if someone call p2m_set_entry() past p2m_teardown()
(I wasn't able to prove this will never happen). This is done by
the checking domain->is_dying from previous patch in p2m_set_entry().

Similarly, we want to make sure that no-one can accessed the free
pages. Therefore the root is cleared before freeing pages.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Tested-by: Henry Wang <Henry.Wang@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -774,10 +774,10 @@ fail:
 void arch_domain_destroy(struct domain *d)
 {
     /* IOMMU page table is shared with P2M, always call
-     * iommu_domain_destroy() before p2m_teardown().
+     * iommu_domain_destroy() before p2m_final_teardown().
      */
     iommu_domain_destroy(d);
-    p2m_teardown(d);
+    p2m_final_teardown(d);
     domain_vgic_free(d);
     domain_vuart_free(d);
     free_xenheap_page(d->shared_info);
@@ -979,6 +979,7 @@ enum {
     PROG_xen,
     PROG_page,
     PROG_mapping,
+    PROG_p2m,
     PROG_done,
 };
 
@@ -1029,6 +1030,11 @@ int domain_relinquish_resources(struct d
         if ( ret )
             return ret;
 
+    PROGRESS(p2m):
+        ret = p2m_teardown(d);
+        if ( ret )
+            return ret;
+
     PROGRESS(done):
         break;
 
--- a/xen/include/asm-arm/p2m.h
+++ b/xen/include/asm-arm/p2m.h
@@ -183,8 +183,17 @@ void setup_virt_paging(void);
 /* Init the datastructures for later use by the p2m code */
 int p2m_init(struct domain *d);
 
-/* Return all the p2m resources to Xen. */
-void p2m_teardown(struct domain *d);
+/*
+ * The P2M resources are freed in two parts:
+ *  - p2m_teardown() will be called when relinquish the resources. It
+ *    will free large resources (e.g. intermediate page-tables) that
+ *    requires preemption.
+ *  - p2m_final_teardown() will be called when domain struct is been
+ *    freed. This *cannot* be preempted and therefore one small
+ *    resources should be freed here.
+ */
+int p2m_teardown(struct domain *d);
+void p2m_final_teardown(struct domain *d);
 
 /*
  * Remove mapping refcount on each mapping page in the p2m
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1496,17 +1496,58 @@ static void p2m_free_vmid(struct domain
     spin_unlock(&vmid_alloc_lock);
 }
 
-void p2m_teardown(struct domain *d)
+int p2m_teardown(struct domain *d)
 {
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    unsigned long count = 0;
     struct page_info *pg;
+    unsigned int i;
+    int rc = 0;
+
+    p2m_write_lock(p2m);
+
+    /*
+     * We are about to free the intermediate page-tables, so clear the
+     * root to prevent any walk to use them.
+     */
+    for ( i = 0; i < P2M_ROOT_PAGES; i++ )
+        clear_and_clean_page(p2m->root + i);
+
+    /*
+     * The domain will not be scheduled anymore, so in theory we should
+     * not need to flush the TLBs. Do it for safety purpose.
+     *
+     * Note that all the devices have already been de-assigned. So we don't
+     * need to flush the IOMMU TLB here.
+     */
+    p2m_force_tlb_flush_sync(p2m);
+
+    while ( (pg = page_list_remove_head(&p2m->pages)) )
+    {
+        free_domheap_page(pg);
+        count++;
+        /* Arbitrarily preempt every 512 iterations */
+        if ( !(count % 512) && hypercall_preempt_check() )
+        {
+            rc = -ERESTART;
+            break;
+        }
+    }
+
+    p2m_write_unlock(p2m);
+
+    return rc;
+}
+
+void p2m_final_teardown(struct domain *d)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
     /* p2m not actually initialized */
     if ( !p2m->domain )
         return;
 
-    while ( (pg = page_list_remove_head(&p2m->pages)) )
-        free_domheap_page(pg);
+    ASSERT(page_list_empty(&p2m->pages));
 
     if ( p2m->root )
         free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: add option to skip root pagetable removal in p2m_teardown()

Add a new parameter to p2m_teardown() in order to select whether the
root page table should also be freed.  Note that all users are
adjusted to pass the parameter to remove the root page tables, so
behavior is not modified.

No functional change intended.

This is part of CVE-2022-33746 / XSA-410.

Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -595,7 +595,7 @@ int p2m_init(struct domain *d);
 int p2m_alloc_table(struct p2m_domain *p2m);
 
 /* Return all the p2m resources to Xen. */
-void p2m_teardown(struct p2m_domain *p2m);
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root);
 void p2m_final_teardown(struct domain *d);
 
 /* Add a page to a domain's p2m table */
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -540,18 +540,18 @@ void hap_final_teardown(struct domain *d
         }
 
         for ( i = 0; i < MAX_ALTP2M; i++ )
-            p2m_teardown(d->arch.altp2m_p2m[i]);
+            p2m_teardown(d->arch.altp2m_p2m[i], true);
     }
 
     /* Destroy nestedp2m's first */
     for (i = 0; i < MAX_NESTEDP2M; i++) {
-        p2m_teardown(d->arch.nested_p2m[i]);
+        p2m_teardown(d->arch.nested_p2m[i], true);
     }
 
     if ( d->arch.paging.hap.total_pages != 0 )
         hap_teardown(d, NULL);
 
-    p2m_teardown(p2m_get_hostp2m(d));
+    p2m_teardown(p2m_get_hostp2m(d), true);
     /* Free any memory that the p2m teardown released */
     paging_lock(d);
     hap_set_allocation(d, 0, NULL);
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -737,11 +737,11 @@ int p2m_alloc_table(struct p2m_domain *p
  * hvm fixme: when adding support for pvh non-hardware domains, this path must
  * cleanup any foreign p2m types (release refcnts on them).
  */
-void p2m_teardown(struct p2m_domain *p2m)
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root)
 /* Return all the p2m pages to Xen.
  * We know we don't have any extra mappings to these pages */
 {
-    struct page_info *pg;
+    struct page_info *pg, *root_pg = NULL;
     struct domain *d;
 
     if (p2m == NULL)
@@ -751,10 +751,22 @@ void p2m_teardown(struct p2m_domain *p2m
 
     p2m_lock(p2m);
     ASSERT(atomic_read(&d->shr_pages) == 0);
-    p2m->phys_table = pagetable_null();
+
+    if ( remove_root )
+        p2m->phys_table = pagetable_null();
+    else if ( !pagetable_is_null(p2m->phys_table) )
+    {
+        root_pg = pagetable_get_page(p2m->phys_table);
+        clear_domain_page(pagetable_get_mfn(p2m->phys_table));
+    }
 
     while ( (pg = page_list_remove_head(&p2m->pages)) )
-        d->arch.paging.free_page(d, pg);
+        if ( pg != root_pg )
+            d->arch.paging.free_page(d, pg);
+
+    if ( root_pg )
+        page_list_add(root_pg, &p2m->pages);
+
     p2m_unlock(p2m);
 }
 
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2686,7 +2686,7 @@ int shadow_enable(struct domain *d, u32
     paging_unlock(d);
  out_unlocked:
     if ( rv != 0 && !pagetable_is_null(p2m_get_pagetable(p2m)) )
-        p2m_teardown(p2m);
+        p2m_teardown(p2m, true);
     if ( rv != 0 && pg != NULL )
     {
         pg->count_info &= ~PGC_count_mask;
@@ -2839,7 +2839,7 @@ void shadow_final_teardown(struct domain
         shadow_teardown(d, NULL);
 
     /* It is now safe to pull down the p2m map. */
-    p2m_teardown(p2m_get_hostp2m(d));
+    p2m_teardown(p2m_get_hostp2m(d), true);
     /* Free any shadow memory that the p2m teardown released */
     paging_lock(d);
     shadow_set_allocation(d, 0, NULL);
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/HAP: adjust monitor table related error handling

hap_make_monitor_table() will return INVALID_MFN if it encounters an
error condition, but hap_update_paging_modes() wasn’t handling this
value, resulting in an inappropriate value being stored in
monitor_table. This would subsequently misguide at least
hap_vcpu_teardown(). Avoid this by bailing early.

Further, when a domain has/was already crashed or (perhaps less
important as there's no such path known to lead here) is already dying,
avoid calling domain_crash() on it again - that's at best confusing.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -39,6 +39,7 @@
 #include <asm/domain.h>
 #include <xen/numa.h>
 #include <asm/hvm/nestedhvm.h>
+#include <public/sched.h>
 
 #include "private.h"
 
@@ -404,8 +405,13 @@ static mfn_t hap_make_monitor_table(stru
     return m4mfn;
 
  oom:
-    printk(XENLOG_G_ERR "out of memory building monitor pagetable\n");
-    domain_crash(d);
+    if ( !d->is_dying &&
+         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
+    {
+        printk(XENLOG_G_ERR "%pd: out of memory building monitor pagetable\n",
+               d);
+        domain_crash(d);
+    }
     return INVALID_MFN;
 }
 
@@ -758,6 +764,9 @@ static void hap_update_paging_modes(stru
     if ( pagetable_is_null(v->arch.hvm.monitor_table) )
     {
         mfn_t mmfn = hap_make_monitor_table(v);
+
+        if ( mfn_eq(mmfn, INVALID_MFN) )
+            goto unlock;
         v->arch.hvm.monitor_table = pagetable_from_mfn(mmfn);
         make_cr3(v, mmfn);
         hvm_update_host_cr3(v);
@@ -766,6 +775,7 @@ static void hap_update_paging_modes(stru
     /* CR3 is effectively updated by a mode change. Flush ASIDs, etc. */
     hap_update_cr3(v, 0, false);
 
+ unlock:
     paging_unlock(d);
     put_gfn(d, cr3_gfn);
 }
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/shadow: tolerate failure of sh_set_toplevel_shadow()

Subsequently sh_set_toplevel_shadow() will be adjusted to install a
blank entry in case prealloc fails. There are, in fact, pre-existing
error paths which would put in place a blank entry. The 4- and 2-level
code in sh_update_cr3(), however, assume the top level entry to be
valid.

Hence bail from the function in the unlikely event that it's not. Note
that 3-level logic works differently: In particular a guest is free to
supply a PDPTR pointing at 4 non-present (or otherwise deemed invalid)
entries. The guest will crash, but we already cope with that.

Really mfn_valid() is likely wrong to use in sh_set_toplevel_shadow(),
and it should instead be !mfn_eq(gmfn, INVALID_MFN). Avoid such a change
in security context, but add a respective assertion.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3854,6 +3854,7 @@ sh_set_toplevel_shadow(struct vcpu *v,
     /* Now figure out the new contents: is this a valid guest MFN? */
     if ( !mfn_valid(gmfn) )
     {
+        ASSERT(mfn_eq(gmfn, INVALID_MFN));
         new_entry = pagetable_null();
         goto install_new_entry;
     }
@@ -4007,6 +4008,11 @@ sh_update_cr3(struct vcpu *v, int do_loc
     if ( sh_remove_write_access(d, gmfn, 2, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
     sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l2_shadow);
+    if ( unlikely(pagetable_is_null(v->arch.shadow_table[0])) )
+    {
+        ASSERT(d->is_dying || d->is_shutting_down);
+        return;
+    }
 #elif GUEST_PAGING_LEVELS == 3
     /* PAE guests have four shadow_table entries, based on the
      * current values of the guest's four l3es. */
@@ -4052,6 +4058,11 @@ sh_update_cr3(struct vcpu *v, int do_loc
     if ( sh_remove_write_access(d, gmfn, 4, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
     sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l4_shadow);
+    if ( unlikely(pagetable_is_null(v->arch.shadow_table[0])) )
+    {
+        ASSERT(d->is_dying || d->is_shutting_down);
+        return;
+    }
     if ( !shadow_mode_external(d) && !is_pv_32bit_domain(d) )
     {
         mfn_t smfn = pagetable_get_mfn(v->arch.shadow_table[0]);
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/shadow: tolerate failure in shadow_prealloc()

Prevent _shadow_prealloc() from calling BUG() when unable to fulfill
the pre-allocation and instead return true/false.  Modify
shadow_prealloc() to crash the domain on allocation failure (if the
domain is not already dying), as shadow cannot operate normally after
that.  Modify callers to also gracefully handle {_,}shadow_prealloc()
failing to fulfill the request.

Note this in turn requires adjusting the callers of
sh_make_monitor_table() also to handle it returning INVALID_MFN.
sh_update_paging_modes() is also modified to add additional error
paths in case of allocation failure, some of those will return with
null monitor page tables (and the domain likely crashed).  This is no
different that current error paths, but the newly introduced ones are
more likely to trigger.

The now added failure points in sh_update_paging_modes() also require
that on some error return paths the previous structures are cleared,
and thus monitor table is null.

While there adjust the 'type' parameter type of shadow_prealloc() to
unsigned int rather than u32.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -36,6 +36,7 @@
 #include <asm/shadow.h>
 #include <asm/hvm/ioreq.h>
 #include <xen/numa.h>
+#include <public/sched.h>
 #include "private.h"
 
 DEFINE_PER_CPU(uint32_t,trace_shadow_path_flags);
@@ -927,14 +928,15 @@ static inline void trace_shadow_prealloc
 
 /* Make sure there are at least count order-sized pages
  * available in the shadow page pool. */
-static void _shadow_prealloc(struct domain *d, unsigned int pages)
+static bool __must_check _shadow_prealloc(struct domain *d, unsigned int pages)
 {
     struct vcpu *v;
     struct page_info *sp, *t;
     mfn_t smfn;
     int i;
 
-    if ( d->arch.paging.shadow.free_pages >= pages ) return;
+    if ( d->arch.paging.shadow.free_pages >= pages )
+        return true;
 
     /* Shouldn't have enabled shadows if we've no vcpus. */
     ASSERT(d->vcpu && d->vcpu[0]);
@@ -950,7 +952,8 @@ static void _shadow_prealloc(struct doma
         sh_unpin(d, smfn);
 
         /* See if that freed up enough space */
-        if ( d->arch.paging.shadow.free_pages >= pages ) return;
+        if ( d->arch.paging.shadow.free_pages >= pages )
+            return true;
     }
 
     /* Stage two: all shadow pages are in use in hierarchies that are
@@ -971,7 +974,7 @@ static void _shadow_prealloc(struct doma
                 if ( d->arch.paging.shadow.free_pages >= pages )
                 {
                     guest_flush_tlb_mask(d, d->dirty_cpumask);
-                    return;
+                    return true;
                 }
             }
         }
@@ -984,7 +987,12 @@ static void _shadow_prealloc(struct doma
            d->arch.paging.shadow.total_pages,
            d->arch.paging.shadow.free_pages,
            d->arch.paging.shadow.p2m_pages);
-    BUG();
+
+    ASSERT(d->is_dying);
+
+    guest_flush_tlb_mask(d, d->dirty_cpumask);
+
+    return false;
 }
 
 /* Make sure there are at least count pages of the order according to
@@ -992,9 +1000,19 @@ static void _shadow_prealloc(struct doma
  * This must be called before any calls to shadow_alloc().  Since this
  * will free existing shadows to make room, it must be called early enough
  * to avoid freeing shadows that the caller is currently working on. */
-void shadow_prealloc(struct domain *d, u32 type, unsigned int count)
+bool shadow_prealloc(struct domain *d, unsigned int type, unsigned int count)
 {
-    return _shadow_prealloc(d, shadow_size(type) * count);
+    bool ret = _shadow_prealloc(d, shadow_size(type) * count);
+
+    if ( !ret && !d->is_dying &&
+         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
+        /*
+         * Failing to allocate memory required for shadow usage can only result in
+         * a domain crash, do it here rather that relying on every caller to do it.
+         */
+        domain_crash(d);
+
+    return ret;
 }
 
 /* Deliberately free all the memory we can: this will tear down all of
@@ -1211,7 +1229,7 @@ void shadow_free(struct domain *d, mfn_t
 static struct page_info *
 shadow_alloc_p2m_page(struct domain *d)
 {
-    struct page_info *pg;
+    struct page_info *pg = NULL;
 
     /* This is called both from the p2m code (which never holds the
      * paging lock) and the log-dirty code (which always does). */
@@ -1229,16 +1247,18 @@ shadow_alloc_p2m_page(struct domain *d)
                     d->arch.paging.shadow.p2m_pages,
                     shadow_min_acceptable_pages(d));
         }
-        paging_unlock(d);
-        return NULL;
+        goto out;
     }
 
-    shadow_prealloc(d, SH_type_p2m_table, 1);
+    if ( !shadow_prealloc(d, SH_type_p2m_table, 1) )
+        goto out;
+
     pg = mfn_to_page(shadow_alloc(d, SH_type_p2m_table, 0));
     d->arch.paging.shadow.p2m_pages++;
     d->arch.paging.shadow.total_pages--;
     ASSERT(!page_get_owner(pg) && !(pg->count_info & PGC_count_mask));
 
+ out:
     paging_unlock(d);
 
     return pg;
@@ -1329,7 +1349,9 @@ int shadow_set_allocation(struct domain
         else if ( d->arch.paging.shadow.total_pages > pages )
         {
             /* Need to return memory to domheap */
-            _shadow_prealloc(d, 1);
+            if ( !_shadow_prealloc(d, 1) )
+                return -ENOMEM;
+
             sp = page_list_remove_head(&d->arch.paging.shadow.freelist);
             ASSERT(sp);
             /*
@@ -2397,12 +2419,13 @@ static void sh_update_paging_modes(struc
     if ( mfn_eq(v->arch.paging.shadow.oos_snapshot[0], INVALID_MFN) )
     {
         int i;
+
+        if ( !shadow_prealloc(d, SH_type_oos_snapshot, SHADOW_OOS_PAGES) )
+            return;
+
         for(i = 0; i < SHADOW_OOS_PAGES; i++)
-        {
-            shadow_prealloc(d, SH_type_oos_snapshot, 1);
             v->arch.paging.shadow.oos_snapshot[i] =
                 shadow_alloc(d, SH_type_oos_snapshot, 0);
-        }
     }
 #endif /* OOS */
 
@@ -2464,6 +2487,10 @@ static void sh_update_paging_modes(struc
         if ( pagetable_is_null(v->arch.hvm.monitor_table) )
         {
             mfn_t mmfn = v->arch.paging.mode->shadow.make_monitor_table(v);
+
+            if ( mfn_eq(mmfn, INVALID_MFN) )
+                return;
+
             v->arch.hvm.monitor_table = pagetable_from_mfn(mmfn);
             make_cr3(v, mmfn);
             hvm_update_host_cr3(v);
@@ -2501,6 +2528,11 @@ static void sh_update_paging_modes(struc
                 old_mfn = pagetable_get_mfn(v->arch.hvm.monitor_table);
                 v->arch.hvm.monitor_table = pagetable_null();
                 new_mfn = v->arch.paging.mode->shadow.make_monitor_table(v);
+                if ( mfn_eq(new_mfn, INVALID_MFN) )
+                {
+                    old_mode->shadow.destroy_monitor_table(v, old_mfn);
+                    return;
+                }
                 v->arch.hvm.monitor_table = pagetable_from_mfn(new_mfn);
                 SHADOW_PRINTK("new monitor table %"PRI_mfn "\n",
                                mfn_x(new_mfn));
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -1535,7 +1535,8 @@ sh_make_monitor_table(struct vcpu *v)
     ASSERT(pagetable_get_pfn(v->arch.hvm.monitor_table) == 0);
 
     /* Guarantee we can get the memory we need */
-    shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS);
+    if ( !shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS) )
+        return INVALID_MFN;
 
     {
         mfn_t m4mfn;
@@ -3067,9 +3068,14 @@ static int sh_page_fault(struct vcpu *v,
      * Preallocate shadow pages *before* removing writable accesses
      * otherwhise an OOS L1 might be demoted and promoted again with
      * writable mappings. */
-    shadow_prealloc(d,
-                    SH_type_l1_shadow,
-                    GUEST_PAGING_LEVELS < 4 ? 1 : GUEST_PAGING_LEVELS - 1);
+    if ( !shadow_prealloc(d, SH_type_l1_shadow,
+                          GUEST_PAGING_LEVELS < 4
+                          ? 1 : GUEST_PAGING_LEVELS - 1) )
+    {
+        paging_unlock(d);
+        put_gfn(d, gfn_x(gfn));
+        return 0;
+    }
 
     rc = gw_remove_write_accesses(v, va, &gw);
 
@@ -3864,7 +3870,12 @@ sh_set_toplevel_shadow(struct vcpu *v,
     if ( !mfn_valid(smfn) )
     {
         /* Make sure there's enough free shadow memory. */
-        shadow_prealloc(d, root_type, 1);
+        if ( !shadow_prealloc(d, root_type, 1) )
+        {
+            new_entry = pagetable_null();
+            goto install_new_entry;
+        }
+
         /* Shadow the page. */
         smfn = sh_make_shadow(v, gmfn, root_type);
     }
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -351,7 +351,8 @@ void shadow_promote(struct domain *d, mf
 void shadow_demote(struct domain *d, mfn_t gmfn, u32 type);
 
 /* Shadow page allocation functions */
-void  shadow_prealloc(struct domain *d, u32 shadow_type, unsigned int count);
+bool __must_check shadow_prealloc(struct domain *d, unsigned int shadow_type,
+                                  unsigned int count);
 mfn_t shadow_alloc(struct domain *d,
                     u32 shadow_type,
                     unsigned long backpointer);
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: refuse new allocations for dying domains

This will in particular prevent any attempts to add entries to the p2m,
once - in a subsequent change - non-root entries have been removed.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -244,6 +244,9 @@ static struct page_info *hap_alloc(struc
 
     ASSERT(paging_locked_by_me(d));
 
+    if ( unlikely(d->is_dying) )
+        return NULL;
+
     pg = page_list_remove_head(&d->arch.paging.hap.freelist);
     if ( unlikely(!pg) )
         return NULL;
@@ -280,7 +283,7 @@ static struct page_info *hap_alloc_p2m_p
         d->arch.paging.hap.p2m_pages++;
         ASSERT(!page_get_owner(pg) && !(pg->count_info & PGC_count_mask));
     }
-    else if ( !d->arch.paging.p2m_alloc_failed )
+    else if ( !d->arch.paging.p2m_alloc_failed && !d->is_dying )
     {
         d->arch.paging.p2m_alloc_failed = 1;
         dprintk(XENLOG_ERR, "d%i failed to allocate from HAP pool\n",
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -938,6 +938,10 @@ static bool __must_check _shadow_preallo
     if ( d->arch.paging.shadow.free_pages >= pages )
         return true;
 
+    if ( unlikely(d->is_dying) )
+        /* No reclaim when the domain is dying, teardown will take care of it. */
+        return false;
+
     /* Shouldn't have enabled shadows if we've no vcpus. */
     ASSERT(d->vcpu && d->vcpu[0]);
 
@@ -988,7 +992,7 @@ static bool __must_check _shadow_preallo
            d->arch.paging.shadow.free_pages,
            d->arch.paging.shadow.p2m_pages);
 
-    ASSERT(d->is_dying);
+    ASSERT_UNREACHABLE();
 
     guest_flush_tlb_mask(d, d->dirty_cpumask);
 
@@ -1002,10 +1006,13 @@ static bool __must_check _shadow_preallo
  * to avoid freeing shadows that the caller is currently working on. */
 bool shadow_prealloc(struct domain *d, unsigned int type, unsigned int count)
 {
-    bool ret = _shadow_prealloc(d, shadow_size(type) * count);
+    bool ret;
+
+    if ( unlikely(d->is_dying) )
+       return false;
 
-    if ( !ret && !d->is_dying &&
-         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
+    ret = _shadow_prealloc(d, shadow_size(type) * count);
+    if ( !ret && (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
         /*
          * Failing to allocate memory required for shadow usage can only result in
          * a domain crash, do it here rather that relying on every caller to do it.
@@ -1231,6 +1238,9 @@ shadow_alloc_p2m_page(struct domain *d)
 {
     struct page_info *pg = NULL;
 
+    if ( unlikely(d->is_dying) )
+       return NULL;
+
     /* This is called both from the p2m code (which never holds the
      * paging lock) and the log-dirty code (which always does). */
     paging_lock_recursive(d);
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: truly free paging pool memory for dying domains

Modify {hap,shadow}_free to free the page immediately if the domain is
dying, so that pages don't accumulate in the pool when
{shadow,hap}_final_teardown() get called. This is to limit the amount of
work which needs to be done there (in a non-preemptable manner).

Note the call to shadow_free() in shadow_free_p2m_page() is moved after
increasing total_pages, so that the decrease done in shadow_free() in
case the domain is dying doesn't underflow the counter, even if just for
a short interval.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -264,6 +264,18 @@ static void hap_free(struct domain *d, m
 
     ASSERT(paging_locked_by_me(d));
 
+    /*
+     * For dying domains, actually free the memory here. This way less work is
+     * left to hap_final_teardown(), which cannot easily have preemption checks
+     * added.
+     */
+    if ( unlikely(d->is_dying) )
+    {
+        free_domheap_page(pg);
+        d->arch.paging.hap.total_pages--;
+        return;
+    }
+
     d->arch.paging.hap.free_pages++;
     page_list_add_tail(pg, &d->arch.paging.hap.freelist);
 }
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1180,6 +1180,7 @@ mfn_t shadow_alloc(struct domain *d,
 void shadow_free(struct domain *d, mfn_t smfn)
 {
     struct page_info *next = NULL, *sp = mfn_to_page(smfn);
+    bool dying = ACCESS_ONCE(d->is_dying);
     struct page_list_head *pin_list;
     unsigned int pages;
     u32 shadow_type;
@@ -1222,11 +1223,32 @@ void shadow_free(struct domain *d, mfn_t
          * just before the allocator hands the page out again. */
         page_set_tlbflush_timestamp(sp);
         perfc_decr(shadow_alloc_count);
-        page_list_add_tail(sp, &d->arch.paging.shadow.freelist);
+
+        /*
+         * For dying domains, actually free the memory here. This way less
+         * work is left to shadow_final_teardown(), which cannot easily have
+         * preemption checks added.
+         */
+        if ( unlikely(dying) )
+        {
+            /*
+             * The backpointer field (sh.back) used by shadow code aliases the
+             * domain owner field, unconditionally clear it here to avoid
+             * free_domheap_page() attempting to parse it.
+             */
+            page_set_owner(sp, NULL);
+            free_domheap_page(sp);
+        }
+        else
+            page_list_add_tail(sp, &d->arch.paging.shadow.freelist);
+
         sp = next;
     }
 
-    d->arch.paging.shadow.free_pages += pages;
+    if ( unlikely(dying) )
+        d->arch.paging.shadow.total_pages -= pages;
+    else
+        d->arch.paging.shadow.free_pages += pages;
 }
 
 /* Divert a page from the pool to be used by the p2m mapping.
@@ -1296,9 +1318,9 @@ shadow_free_p2m_page(struct domain *d, s
      * paging lock) and the log-dirty code (which always does). */
     paging_lock_recursive(d);
 
-    shadow_free(d, page_to_mfn(pg));
     d->arch.paging.shadow.p2m_pages--;
     d->arch.paging.shadow.total_pages++;
+    shadow_free(d, page_to_mfn(pg));
 
     paging_unlock(d);
 }
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: free the paging memory pool preemptively

The paging memory pool is currently freed in two different places:
from {shadow,hap}_teardown() via domain_relinquish_resources() and
from {shadow,hap}_final_teardown() via complete_domain_destroy().
While the former does handle preemption, the later doesn't.

Attempt to move as much p2m related freeing as possible to happen
before the call to {shadow,hap}_teardown(), so that most memory can be
freed in a preemptive way.  In order to avoid causing issues to
existing callers leave the root p2m page tables set and free them in
{hap,shadow}_final_teardown().  Also modify {hap,shadow}_free to free
the page immediately if the domain is dying, so that pages don't
accumulate in the pool when {shadow,hap}_final_teardown() get called.

Move altp2m_vcpu_disable_ve() to be done in hap_teardown(), as that's
the place where altp2m_active gets disabled now.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -38,7 +38,6 @@
 #include <xen/livepatch.h>
 #include <public/sysctl.h>
 #include <public/hvm/hvm_vcpu.h>
-#include <asm/altp2m.h>
 #include <asm/regs.h>
 #include <asm/mc146818rtc.h>
 #include <asm/system.h>
@@ -2120,12 +2119,6 @@ int domain_relinquish_resources(struct d
             vpmu_destroy(v);
         }
 
-        if ( altp2m_active(d) )
-        {
-            for_each_vcpu ( d, v )
-                altp2m_vcpu_disable_ve(v);
-        }
-
         if ( is_pv_domain(d) )
         {
             for_each_vcpu ( d, v )
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -28,6 +28,7 @@
 #include <xen/domain_page.h>
 #include <xen/guest_access.h>
 #include <xen/keyhandler.h>
+#include <asm/altp2m.h>
 #include <asm/event.h>
 #include <asm/page.h>
 #include <asm/current.h>
@@ -545,24 +546,8 @@ void hap_final_teardown(struct domain *d
     unsigned int i;
 
     if ( hvm_altp2m_supported() )
-    {
-        d->arch.altp2m_active = 0;
-
-        if ( d->arch.altp2m_eptp )
-        {
-            free_xenheap_page(d->arch.altp2m_eptp);
-            d->arch.altp2m_eptp = NULL;
-        }
-
-        if ( d->arch.altp2m_visible_eptp )
-        {
-            free_xenheap_page(d->arch.altp2m_visible_eptp);
-            d->arch.altp2m_visible_eptp = NULL;
-        }
-
         for ( i = 0; i < MAX_ALTP2M; i++ )
             p2m_teardown(d->arch.altp2m_p2m[i], true);
-    }
 
     /* Destroy nestedp2m's first */
     for (i = 0; i < MAX_NESTEDP2M; i++) {
@@ -577,6 +562,8 @@ void hap_final_teardown(struct domain *d
     paging_lock(d);
     hap_set_allocation(d, 0, NULL);
     ASSERT(d->arch.paging.hap.p2m_pages == 0);
+    ASSERT(d->arch.paging.hap.free_pages == 0);
+    ASSERT(d->arch.paging.hap.total_pages == 0);
     paging_unlock(d);
 }
 
@@ -584,6 +571,7 @@ void hap_teardown(struct domain *d, bool
 {
     struct vcpu *v;
     mfn_t mfn;
+    unsigned int i;
 
     ASSERT(d->is_dying);
     ASSERT(d != current->domain);
@@ -605,6 +593,32 @@ void hap_teardown(struct domain *d, bool
         }
     }
 
+    paging_unlock(d);
+
+    /* Leave the root pt in case we get further attempts to modify the p2m. */
+    if ( hvm_altp2m_supported() )
+    {
+        if ( altp2m_active(d) )
+            for_each_vcpu ( d, v )
+                altp2m_vcpu_disable_ve(v);
+
+        d->arch.altp2m_active = 0;
+
+        FREE_XENHEAP_PAGE(d->arch.altp2m_eptp);
+        FREE_XENHEAP_PAGE(d->arch.altp2m_visible_eptp);
+
+        for ( i = 0; i < MAX_ALTP2M; i++ )
+            p2m_teardown(d->arch.altp2m_p2m[i], false);
+    }
+
+    /* Destroy nestedp2m's after altp2m. */
+    for ( i = 0; i < MAX_NESTEDP2M; i++ )
+        p2m_teardown(d->arch.nested_p2m[i], false);
+
+    p2m_teardown(p2m_get_hostp2m(d), false);
+
+    paging_lock(d);
+
     if ( d->arch.paging.hap.total_pages != 0 )
     {
         hap_set_allocation(d, 0, preempted);
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2795,6 +2795,19 @@ void shadow_teardown(struct domain *d, b
         }
     }
 
+    paging_unlock(d);
+
+    p2m_teardown(p2m_get_hostp2m(d), false);
+
+    paging_lock(d);
+
+    /*
+     * Reclaim all shadow memory so that shadow_set_allocation() doesn't find
+     * in-use pages, as _shadow_prealloc() will no longer try to reclaim pages
+     * because the domain is dying.
+     */
+    shadow_blow_tables(d);
+
 #if (SHADOW_OPTIMIZATIONS & (SHOPT_VIRTUAL_TLB|SHOPT_OUT_OF_SYNC))
     /* Free the virtual-TLB array attached to each vcpu */
     for_each_vcpu(d, v)
@@ -2913,6 +2926,9 @@ void shadow_final_teardown(struct domain
                    d->arch.paging.shadow.total_pages,
                    d->arch.paging.shadow.free_pages,
                    d->arch.paging.shadow.p2m_pages);
+    ASSERT(!d->arch.paging.shadow.total_pages);
+    ASSERT(!d->arch.paging.shadow.free_pages);
+    ASSERT(!d->arch.paging.shadow.p2m_pages);
     paging_unlock(d);
 }
 
From: Julien Grall <jgrall@amazon.com>
Subject: xen/x86: p2m: Add preemption in p2m_teardown()

The list p2m->pages contain all the pages used by the P2M. On large
instance this can be quite large and the time spent to call
d->arch.paging.free_page() will take more than 1ms for a 80GB guest
on a Xen running in nested environment on a c5.metal.

By extrapolation, it would take > 100ms for a 8TB guest (what we
current security support). So add some preemption in p2m_teardown()
and propagate to the callers. Note there are 3 places where
the preemption is not enabled:
    - hap_final_teardown()/shadow_final_teardown(): We are
      preventing update the P2M once the domain is dying (so
      no more pages could be allocated) and most of the P2M pages
      will be freed in preemptive manneer when relinquishing the
      resources. So this is fine to disable preemption.
    - shadow_enable(): This is fine because it will undo the allocation
      that may have been made by p2m_alloc_table() (so only the root
      page table).

The preemption is arbitrarily checked every 1024 iterations.

Note that with the current approach, Xen doesn't keep track on whether
the alt/nested P2Ms have been cleared. So there are some redundant work.
However, this is not expected to incurr too much overhead (the P2M lock
shouldn't be contended during teardown). So this is optimization is
left outside of the security event.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -595,7 +595,7 @@ int p2m_init(struct domain *d);
 int p2m_alloc_table(struct p2m_domain *p2m);
 
 /* Return all the p2m resources to Xen. */
-void p2m_teardown(struct p2m_domain *p2m, bool remove_root);
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root, bool *preempted);
 void p2m_final_teardown(struct domain *d);
 
 /* Add a page to a domain's p2m table */
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -547,17 +547,17 @@ void hap_final_teardown(struct domain *d
 
     if ( hvm_altp2m_supported() )
         for ( i = 0; i < MAX_ALTP2M; i++ )
-            p2m_teardown(d->arch.altp2m_p2m[i], true);
+            p2m_teardown(d->arch.altp2m_p2m[i], true, NULL);
 
     /* Destroy nestedp2m's first */
     for (i = 0; i < MAX_NESTEDP2M; i++) {
-        p2m_teardown(d->arch.nested_p2m[i], true);
+        p2m_teardown(d->arch.nested_p2m[i], true, NULL);
     }
 
     if ( d->arch.paging.hap.total_pages != 0 )
         hap_teardown(d, NULL);
 
-    p2m_teardown(p2m_get_hostp2m(d), true);
+    p2m_teardown(p2m_get_hostp2m(d), true, NULL);
     /* Free any memory that the p2m teardown released */
     paging_lock(d);
     hap_set_allocation(d, 0, NULL);
@@ -608,14 +608,24 @@ void hap_teardown(struct domain *d, bool
         FREE_XENHEAP_PAGE(d->arch.altp2m_visible_eptp);
 
         for ( i = 0; i < MAX_ALTP2M; i++ )
-            p2m_teardown(d->arch.altp2m_p2m[i], false);
+        {
+            p2m_teardown(d->arch.altp2m_p2m[i], false, preempted);
+            if ( preempted && *preempted )
+                return;
+        }
     }
 
     /* Destroy nestedp2m's after altp2m. */
     for ( i = 0; i < MAX_NESTEDP2M; i++ )
-        p2m_teardown(d->arch.nested_p2m[i], false);
+    {
+        p2m_teardown(d->arch.nested_p2m[i], false, preempted);
+        if ( preempted && *preempted )
+            return;
+    }
 
-    p2m_teardown(p2m_get_hostp2m(d), false);
+    p2m_teardown(p2m_get_hostp2m(d), false, preempted);
+    if ( preempted && *preempted )
+        return;
 
     paging_lock(d);
 
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -737,12 +737,13 @@ int p2m_alloc_table(struct p2m_domain *p
  * hvm fixme: when adding support for pvh non-hardware domains, this path must
  * cleanup any foreign p2m types (release refcnts on them).
  */
-void p2m_teardown(struct p2m_domain *p2m, bool remove_root)
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root, bool *preempted)
 /* Return all the p2m pages to Xen.
  * We know we don't have any extra mappings to these pages */
 {
     struct page_info *pg, *root_pg = NULL;
     struct domain *d;
+    unsigned int i = 0;
 
     if (p2m == NULL)
         return;
@@ -761,8 +762,19 @@ void p2m_teardown(struct p2m_domain *p2m
     }
 
     while ( (pg = page_list_remove_head(&p2m->pages)) )
-        if ( pg != root_pg )
-            d->arch.paging.free_page(d, pg);
+    {
+        if ( pg == root_pg )
+            continue;
+
+        d->arch.paging.free_page(d, pg);
+
+        /* Arbitrarily check preemption every 1024 iterations */
+        if ( preempted && !(++i % 1024) && general_preempt_check() )
+        {
+            *preempted = true;
+            break;
+        }
+    }
 
     if ( root_pg )
         page_list_add(root_pg, &p2m->pages);
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2749,8 +2749,12 @@ int shadow_enable(struct domain *d, u32
  out_locked:
     paging_unlock(d);
  out_unlocked:
+    /*
+     * This is fine to ignore the preemption here because only the root
+     * will be allocated by p2m_alloc_table().
+     */
     if ( rv != 0 && !pagetable_is_null(p2m_get_pagetable(p2m)) )
-        p2m_teardown(p2m, true);
+        p2m_teardown(p2m, true, NULL);
     if ( rv != 0 && pg != NULL )
     {
         pg->count_info &= ~PGC_count_mask;
@@ -2797,7 +2801,9 @@ void shadow_teardown(struct domain *d, b
 
     paging_unlock(d);
 
-    p2m_teardown(p2m_get_hostp2m(d), false);
+    p2m_teardown(p2m_get_hostp2m(d), false, preempted);
+    if ( preempted && *preempted )
+        return;
 
     paging_lock(d);
 
@@ -2916,7 +2922,7 @@ void shadow_final_teardown(struct domain
         shadow_teardown(d, NULL);
 
     /* It is now safe to pull down the p2m map. */
-    p2m_teardown(p2m_get_hostp2m(d), true);
+    p2m_teardown(p2m_get_hostp2m(d), true, NULL);
     /* Free any shadow memory that the p2m teardown released */
     paging_lock(d);
     shadow_set_allocation(d, 0, NULL);
From 4b4359122a414cc15156e13e3805988b71ff9da0 Mon Sep 17 00:00:00 2001
From: Julien Grall <jgrall@amazon.com>
Date: Mon, 6 Jun 2022 06:17:25 +0000
Subject: [PATCH 1/2] xen/arm: p2m: Prevent adding mapping when domain is dying

During the domain destroy process, the domain will still be accessible
until it is fully destroyed. So does the P2M because we don't bail
out early if is_dying is non-zero. If a domain has permission to
modify the other domain's P2M (i.e. dom0, or a stubdomain), then
foreign mapping can be added past relinquish_p2m_mapping().

Therefore, we need to prevent mapping to be added when the domain
is dying. This commit prevents such adding of mapping by adding the
d->is_dying check to p2m_set_entry(). Also this commit enhances the
check in relinquish_p2m_mapping() to make sure that no mappings can
be added in the P2M after the P2M lock is released.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Tested-by: Henry Wang <Henry.Wang@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
---
 xen/arch/arm/p2m.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index fb71fa4c1c90..cbeff90f4371 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1093,6 +1093,15 @@ int p2m_set_entry(struct p2m_domain *p2m,
 {
     int rc = 0;
 
+    /*
+     * Any reference taken by the P2M mappings (e.g. foreign mapping) will
+     * be dropped in relinquish_p2m_mapping(). As the P2M will still
+     * be accessible after, we need to prevent mapping to be added when the
+     * domain is dying.
+     */
+    if ( unlikely(p2m->domain->is_dying) )
+        return -ENOMEM;
+
     while ( nr )
     {
         unsigned long mask;
@@ -1610,6 +1619,8 @@ int relinquish_p2m_mapping(struct domain *d)
     unsigned int order;
     gfn_t start, end;
 
+    BUG_ON(!d->is_dying);
+    /* No mappings can be added in the P2M after the P2M lock is released. */
     p2m_write_lock(p2m);
 
     start = p2m->lowest_mapped_gfn;
-- 
2.37.1

From 0d5846490348fa09a0d0915d7c795685a016ce10 Mon Sep 17 00:00:00 2001
From: Julien Grall <jgrall@amazon.com>
Date: Mon, 6 Jun 2022 06:17:26 +0000
Subject: [PATCH 2/2] xen/arm: p2m: Handle preemption when freeing intermediate
 page tables

At the moment the P2M page tables will be freed when the domain structure
is freed without any preemption. As the P2M is quite large, iterating
through this may take more time than it is reasonable without intermediate
preemption (to run softirqs and perhaps scheduler).

Split p2m_teardown() in two parts: one preemptible and called when
relinquishing the resources, the other one non-preemptible and called
when freeing the domain structure.

As we are now freeing the P2M pages early, we also need to prevent
further allocation if someone call p2m_set_entry() past p2m_teardown()
(I wasn't able to prove this will never happen). This is done by
the checking domain->is_dying from previous patch in p2m_set_entry().

Similarly, we want to make sure that no-one can accessed the free
pages. Therefore the root is cleared before freeing pages.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Tested-by: Henry Wang <Henry.Wang@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
---
 xen/arch/arm/domain.c     | 10 +++++++--
 xen/arch/arm/p2m.c        | 47 ++++++++++++++++++++++++++++++++++++---
 xen/include/asm-arm/p2m.h | 13 +++++++++--
 3 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 96e1b235501d..2694c39127c5 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -789,10 +789,10 @@ fail:
 void arch_domain_destroy(struct domain *d)
 {
     /* IOMMU page table is shared with P2M, always call
-     * iommu_domain_destroy() before p2m_teardown().
+     * iommu_domain_destroy() before p2m_final_teardown().
      */
     iommu_domain_destroy(d);
-    p2m_teardown(d);
+    p2m_final_teardown(d);
     domain_vgic_free(d);
     domain_vuart_free(d);
     free_xenheap_page(d->shared_info);
@@ -996,6 +996,7 @@ enum {
     PROG_xen,
     PROG_page,
     PROG_mapping,
+    PROG_p2m,
     PROG_done,
 };
 
@@ -1056,6 +1057,11 @@ int domain_relinquish_resources(struct domain *d)
         if ( ret )
             return ret;
 
+    PROGRESS(p2m):
+        ret = p2m_teardown(d);
+        if ( ret )
+            return ret;
+
     PROGRESS(done):
         break;
 
diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index cbeff90f4371..3bcd1e897e88 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -1527,17 +1527,58 @@ static void p2m_free_vmid(struct domain *d)
     spin_unlock(&vmid_alloc_lock);
 }
 
-void p2m_teardown(struct domain *d)
+int p2m_teardown(struct domain *d)
 {
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    unsigned long count = 0;
     struct page_info *pg;
+    unsigned int i;
+    int rc = 0;
+
+    p2m_write_lock(p2m);
+
+    /*
+     * We are about to free the intermediate page-tables, so clear the
+     * root to prevent any walk to use them.
+     */
+    for ( i = 0; i < P2M_ROOT_PAGES; i++ )
+        clear_and_clean_page(p2m->root + i);
+
+    /*
+     * The domain will not be scheduled anymore, so in theory we should
+     * not need to flush the TLBs. Do it for safety purpose.
+     *
+     * Note that all the devices have already been de-assigned. So we don't
+     * need to flush the IOMMU TLB here.
+     */
+    p2m_force_tlb_flush_sync(p2m);
+
+    while ( (pg = page_list_remove_head(&p2m->pages)) )
+    {
+        free_domheap_page(pg);
+        count++;
+        /* Arbitrarily preempt every 512 iterations */
+        if ( !(count % 512) && hypercall_preempt_check() )
+        {
+            rc = -ERESTART;
+            break;
+        }
+    }
+
+    p2m_write_unlock(p2m);
+
+    return rc;
+}
+
+void p2m_final_teardown(struct domain *d)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
     /* p2m not actually initialized */
     if ( !p2m->domain )
         return;
 
-    while ( (pg = page_list_remove_head(&p2m->pages)) )
-        free_domheap_page(pg);
+    ASSERT(page_list_empty(&p2m->pages));
 
     if ( p2m->root )
         free_domheap_pages(p2m->root, P2M_ROOT_ORDER);
diff --git a/xen/include/asm-arm/p2m.h b/xen/include/asm-arm/p2m.h
index 8f11d9c97b5d..b3ba83283e11 100644
--- a/xen/include/asm-arm/p2m.h
+++ b/xen/include/asm-arm/p2m.h
@@ -192,8 +192,17 @@ void setup_virt_paging(void);
 /* Init the datastructures for later use by the p2m code */
 int p2m_init(struct domain *d);
 
-/* Return all the p2m resources to Xen. */
-void p2m_teardown(struct domain *d);
+/*
+ * The P2M resources are freed in two parts:
+ *  - p2m_teardown() will be called when relinquish the resources. It
+ *    will free large resources (e.g. intermediate page-tables) that
+ *    requires preemption.
+ *  - p2m_final_teardown() will be called when domain struct is been
+ *    freed. This *cannot* be preempted and therefore one small
+ *    resources should be freed here.
+ */
+int p2m_teardown(struct domain *d);
+void p2m_final_teardown(struct domain *d);
 
 /*
  * Remove mapping refcount on each mapping page in the p2m
-- 
2.37.1

From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: add option to skip root pagetable removal in p2m_teardown()

Add a new parameter to p2m_teardown() in order to select whether the
root page table should also be freed.  Note that all users are
adjusted to pass the parameter to remove the root page tables, so
behavior is not modified.

No functional change intended.

This is part of CVE-2022-33746 / XSA-410.

Suggested-by: Julien Grall <julien@xen.org>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -574,7 +574,7 @@ int p2m_init(struct domain *d);
 int p2m_alloc_table(struct p2m_domain *p2m);
 
 /* Return all the p2m resources to Xen. */
-void p2m_teardown(struct p2m_domain *p2m);
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root);
 void p2m_final_teardown(struct domain *d);
 
 /* Add a page to a domain's p2m table */
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -541,18 +541,18 @@ void hap_final_teardown(struct domain *d
         }
 
         for ( i = 0; i < MAX_ALTP2M; i++ )
-            p2m_teardown(d->arch.altp2m_p2m[i]);
+            p2m_teardown(d->arch.altp2m_p2m[i], true);
     }
 
     /* Destroy nestedp2m's first */
     for (i = 0; i < MAX_NESTEDP2M; i++) {
-        p2m_teardown(d->arch.nested_p2m[i]);
+        p2m_teardown(d->arch.nested_p2m[i], true);
     }
 
     if ( d->arch.paging.hap.total_pages != 0 )
         hap_teardown(d, NULL);
 
-    p2m_teardown(p2m_get_hostp2m(d));
+    p2m_teardown(p2m_get_hostp2m(d), true);
     /* Free any memory that the p2m teardown released */
     paging_lock(d);
     hap_set_allocation(d, 0, NULL);
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -749,11 +749,11 @@ int p2m_alloc_table(struct p2m_domain *p
  * hvm fixme: when adding support for pvh non-hardware domains, this path must
  * cleanup any foreign p2m types (release refcnts on them).
  */
-void p2m_teardown(struct p2m_domain *p2m)
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root)
 /* Return all the p2m pages to Xen.
  * We know we don't have any extra mappings to these pages */
 {
-    struct page_info *pg;
+    struct page_info *pg, *root_pg = NULL;
     struct domain *d;
 
     if (p2m == NULL)
@@ -763,10 +763,22 @@ void p2m_teardown(struct p2m_domain *p2m
 
     p2m_lock(p2m);
     ASSERT(atomic_read(&d->shr_pages) == 0);
-    p2m->phys_table = pagetable_null();
+
+    if ( remove_root )
+        p2m->phys_table = pagetable_null();
+    else if ( !pagetable_is_null(p2m->phys_table) )
+    {
+        root_pg = pagetable_get_page(p2m->phys_table);
+        clear_domain_page(pagetable_get_mfn(p2m->phys_table));
+    }
 
     while ( (pg = page_list_remove_head(&p2m->pages)) )
-        d->arch.paging.free_page(d, pg);
+        if ( pg != root_pg )
+            d->arch.paging.free_page(d, pg);
+
+    if ( root_pg )
+        page_list_add(root_pg, &p2m->pages);
+
     p2m_unlock(p2m);
 }
 
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2701,7 +2701,7 @@ int shadow_enable(struct domain *d, u32
     paging_unlock(d);
  out_unlocked:
     if ( rv != 0 && !pagetable_is_null(p2m_get_pagetable(p2m)) )
-        p2m_teardown(p2m);
+        p2m_teardown(p2m, true);
     if ( rv != 0 && pg != NULL )
     {
         pg->count_info &= ~PGC_count_mask;
@@ -2866,7 +2866,7 @@ void shadow_final_teardown(struct domain
         shadow_teardown(d, NULL);
 
     /* It is now safe to pull down the p2m map. */
-    p2m_teardown(p2m_get_hostp2m(d));
+    p2m_teardown(p2m_get_hostp2m(d), true);
     /* Free any shadow memory that the p2m teardown released */
     paging_lock(d);
     shadow_set_allocation(d, 0, NULL);
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/HAP: adjust monitor table related error handling

hap_make_monitor_table() will return INVALID_MFN if it encounters an
error condition, but hap_update_paging_modes() wasn’t handling this
value, resulting in an inappropriate value being stored in
monitor_table. This would subsequently misguide at least
hap_vcpu_teardown(). Avoid this by bailing early.

Further, when a domain has/was already crashed or (perhaps less
important as there's no such path known to lead here) is already dying,
avoid calling domain_crash() on it again - that's at best confusing.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -39,6 +39,7 @@
 #include <asm/domain.h>
 #include <xen/numa.h>
 #include <asm/hvm/nestedhvm.h>
+#include <public/sched.h>
 
 #include "private.h"
 
@@ -405,8 +406,13 @@ static mfn_t hap_make_monitor_table(stru
     return m4mfn;
 
  oom:
-    printk(XENLOG_G_ERR "out of memory building monitor pagetable\n");
-    domain_crash(d);
+    if ( !d->is_dying &&
+         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
+    {
+        printk(XENLOG_G_ERR "%pd: out of memory building monitor pagetable\n",
+               d);
+        domain_crash(d);
+    }
     return INVALID_MFN;
 }
 
@@ -766,6 +772,9 @@ static void hap_update_paging_modes(stru
     if ( pagetable_is_null(v->arch.hvm.monitor_table) )
     {
         mfn_t mmfn = hap_make_monitor_table(v);
+
+        if ( mfn_eq(mmfn, INVALID_MFN) )
+            goto unlock;
         v->arch.hvm.monitor_table = pagetable_from_mfn(mmfn);
         make_cr3(v, mmfn);
         hvm_update_host_cr3(v);
@@ -774,6 +783,7 @@ static void hap_update_paging_modes(stru
     /* CR3 is effectively updated by a mode change. Flush ASIDs, etc. */
     hap_update_cr3(v, 0, false);
 
+ unlock:
     paging_unlock(d);
     put_gfn(d, cr3_gfn);
 }
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/shadow: tolerate failure of sh_set_toplevel_shadow()

Subsequently sh_set_toplevel_shadow() will be adjusted to install a
blank entry in case prealloc fails. There are, in fact, pre-existing
error paths which would put in place a blank entry. The 4- and 2-level
code in sh_update_cr3(), however, assume the top level entry to be
valid.

Hence bail from the function in the unlikely event that it's not. Note
that 3-level logic works differently: In particular a guest is free to
supply a PDPTR pointing at 4 non-present (or otherwise deemed invalid)
entries. The guest will crash, but we already cope with that.

Really mfn_valid() is likely wrong to use in sh_set_toplevel_shadow(),
and it should instead be !mfn_eq(gmfn, INVALID_MFN). Avoid such a change
in security context, but add a respective assertion.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2516,6 +2516,7 @@ void sh_set_toplevel_shadow(struct vcpu
     /* Now figure out the new contents: is this a valid guest MFN? */
     if ( !mfn_valid(gmfn) )
     {
+        ASSERT(mfn_eq(gmfn, INVALID_MFN));
         new_entry = pagetable_null();
         goto install_new_entry;
     }
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3312,6 +3312,11 @@ sh_update_cr3(struct vcpu *v, int do_loc
     if ( sh_remove_write_access(d, gmfn, 4, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
     sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l4_shadow, sh_make_shadow);
+    if ( unlikely(pagetable_is_null(v->arch.paging.shadow.shadow_table[0])) )
+    {
+        ASSERT(d->is_dying || d->is_shutting_down);
+        return;
+    }
     if ( !shadow_mode_external(d) && !is_pv_32bit_domain(d) )
     {
         mfn_t smfn = pagetable_get_mfn(v->arch.paging.shadow.shadow_table[0]);
@@ -3370,6 +3375,11 @@ sh_update_cr3(struct vcpu *v, int do_loc
     if ( sh_remove_write_access(d, gmfn, 2, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
     sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l2_shadow, sh_make_shadow);
+    if ( unlikely(pagetable_is_null(v->arch.paging.shadow.shadow_table[0])) )
+    {
+        ASSERT(d->is_dying || d->is_shutting_down);
+        return;
+    }
 #else
 #error This should never happen
 #endif
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/shadow: tolerate failure in shadow_prealloc()

Prevent _shadow_prealloc() from calling BUG() when unable to fulfill
the pre-allocation and instead return true/false.  Modify
shadow_prealloc() to crash the domain on allocation failure (if the
domain is not already dying), as shadow cannot operate normally after
that.  Modify callers to also gracefully handle {_,}shadow_prealloc()
failing to fulfill the request.

Note this in turn requires adjusting the callers of
sh_make_monitor_table() also to handle it returning INVALID_MFN.
sh_update_paging_modes() is also modified to add additional error
paths in case of allocation failure, some of those will return with
null monitor page tables (and the domain likely crashed).  This is no
different that current error paths, but the newly introduced ones are
more likely to trigger.

The now added failure points in sh_update_paging_modes() also require
that on some error return paths the previous structures are cleared,
and thus monitor table is null.

While there adjust the 'type' parameter type of shadow_prealloc() to
unsigned int rather than u32.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -36,6 +36,7 @@
 #include <asm/flushtlb.h>
 #include <asm/shadow.h>
 #include <xen/numa.h>
+#include <public/sched.h>
 #include "private.h"
 
 DEFINE_PER_CPU(uint32_t,trace_shadow_path_flags);
@@ -928,14 +929,15 @@ static inline void trace_shadow_prealloc
 
 /* Make sure there are at least count order-sized pages
  * available in the shadow page pool. */
-static void _shadow_prealloc(struct domain *d, unsigned int pages)
+static bool __must_check _shadow_prealloc(struct domain *d, unsigned int pages)
 {
     struct vcpu *v;
     struct page_info *sp, *t;
     mfn_t smfn;
     int i;
 
-    if ( d->arch.paging.shadow.free_pages >= pages ) return;
+    if ( d->arch.paging.shadow.free_pages >= pages )
+        return true;
 
     /* Shouldn't have enabled shadows if we've no vcpus. */
     ASSERT(d->vcpu && d->vcpu[0]);
@@ -951,7 +953,8 @@ static void _shadow_prealloc(struct doma
         sh_unpin(d, smfn);
 
         /* See if that freed up enough space */
-        if ( d->arch.paging.shadow.free_pages >= pages ) return;
+        if ( d->arch.paging.shadow.free_pages >= pages )
+            return true;
     }
 
     /* Stage two: all shadow pages are in use in hierarchies that are
@@ -974,7 +977,7 @@ static void _shadow_prealloc(struct doma
                 if ( d->arch.paging.shadow.free_pages >= pages )
                 {
                     guest_flush_tlb_mask(d, d->dirty_cpumask);
-                    return;
+                    return true;
                 }
             }
         }
@@ -987,7 +990,12 @@ static void _shadow_prealloc(struct doma
            d->arch.paging.shadow.total_pages,
            d->arch.paging.shadow.free_pages,
            d->arch.paging.shadow.p2m_pages);
-    BUG();
+
+    ASSERT(d->is_dying);
+
+    guest_flush_tlb_mask(d, d->dirty_cpumask);
+
+    return false;
 }
 
 /* Make sure there are at least count pages of the order according to
@@ -995,9 +1003,19 @@ static void _shadow_prealloc(struct doma
  * This must be called before any calls to shadow_alloc().  Since this
  * will free existing shadows to make room, it must be called early enough
  * to avoid freeing shadows that the caller is currently working on. */
-void shadow_prealloc(struct domain *d, u32 type, unsigned int count)
+bool shadow_prealloc(struct domain *d, unsigned int type, unsigned int count)
 {
-    return _shadow_prealloc(d, shadow_size(type) * count);
+    bool ret = _shadow_prealloc(d, shadow_size(type) * count);
+
+    if ( !ret && !d->is_dying &&
+         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
+        /*
+         * Failing to allocate memory required for shadow usage can only result in
+         * a domain crash, do it here rather that relying on every caller to do it.
+         */
+        domain_crash(d);
+
+    return ret;
 }
 
 /* Deliberately free all the memory we can: this will tear down all of
@@ -1218,7 +1236,7 @@ void shadow_free(struct domain *d, mfn_t
 static struct page_info *
 shadow_alloc_p2m_page(struct domain *d)
 {
-    struct page_info *pg;
+    struct page_info *pg = NULL;
 
     /* This is called both from the p2m code (which never holds the
      * paging lock) and the log-dirty code (which always does). */
@@ -1236,16 +1254,18 @@ shadow_alloc_p2m_page(struct domain *d)
                     d->arch.paging.shadow.p2m_pages,
                     shadow_min_acceptable_pages(d));
         }
-        paging_unlock(d);
-        return NULL;
+        goto out;
     }
 
-    shadow_prealloc(d, SH_type_p2m_table, 1);
+    if ( !shadow_prealloc(d, SH_type_p2m_table, 1) )
+        goto out;
+
     pg = mfn_to_page(shadow_alloc(d, SH_type_p2m_table, 0));
     d->arch.paging.shadow.p2m_pages++;
     d->arch.paging.shadow.total_pages--;
     ASSERT(!page_get_owner(pg) && !(pg->count_info & PGC_count_mask));
 
+ out:
     paging_unlock(d);
 
     return pg;
@@ -1336,7 +1356,9 @@ int shadow_set_allocation(struct domain
         else if ( d->arch.paging.shadow.total_pages > pages )
         {
             /* Need to return memory to domheap */
-            _shadow_prealloc(d, 1);
+            if ( !_shadow_prealloc(d, 1) )
+                return -ENOMEM;
+
             sp = page_list_remove_head(&d->arch.paging.shadow.freelist);
             ASSERT(sp);
             /*
@@ -2334,12 +2356,13 @@ static void sh_update_paging_modes(struc
     if ( mfn_eq(v->arch.paging.shadow.oos_snapshot[0], INVALID_MFN) )
     {
         int i;
+
+        if ( !shadow_prealloc(d, SH_type_oos_snapshot, SHADOW_OOS_PAGES) )
+            return;
+
         for(i = 0; i < SHADOW_OOS_PAGES; i++)
-        {
-            shadow_prealloc(d, SH_type_oos_snapshot, 1);
             v->arch.paging.shadow.oos_snapshot[i] =
                 shadow_alloc(d, SH_type_oos_snapshot, 0);
-        }
     }
 #endif /* OOS */
 
@@ -2403,6 +2426,9 @@ static void sh_update_paging_modes(struc
             mfn_t mmfn = sh_make_monitor_table(
                              v, v->arch.paging.mode->shadow.shadow_levels);
 
+            if ( mfn_eq(mmfn, INVALID_MFN) )
+                return;
+
             v->arch.hvm.monitor_table = pagetable_from_mfn(mmfn);
             make_cr3(v, mmfn);
             hvm_update_host_cr3(v);
@@ -2441,6 +2467,12 @@ static void sh_update_paging_modes(struc
                 v->arch.hvm.monitor_table = pagetable_null();
                 new_mfn = sh_make_monitor_table(
                               v, v->arch.paging.mode->shadow.shadow_levels);
+                if ( mfn_eq(new_mfn, INVALID_MFN) )
+                {
+                    sh_destroy_monitor_table(v, old_mfn,
+                                             old_mode->shadow.shadow_levels);
+                    return;
+                }
                 v->arch.hvm.monitor_table = pagetable_from_mfn(new_mfn);
                 SHADOW_PRINTK("new monitor table %"PRI_mfn "\n",
                                mfn_x(new_mfn));
@@ -2526,7 +2558,12 @@ void sh_set_toplevel_shadow(struct vcpu
     if ( !mfn_valid(smfn) )
     {
         /* Make sure there's enough free shadow memory. */
-        shadow_prealloc(d, root_type, 1);
+        if ( !shadow_prealloc(d, root_type, 1) )
+        {
+            new_entry = pagetable_null();
+            goto install_new_entry;
+        }
+
         /* Shadow the page. */
         smfn = make_shadow(v, gmfn, root_type);
     }
--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -700,7 +700,9 @@ mfn_t sh_make_monitor_table(const struct
     ASSERT(!pagetable_get_pfn(v->arch.hvm.monitor_table));
 
     /* Guarantee we can get the memory we need */
-    shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS);
+    if ( !shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS) )
+        return INVALID_MFN;
+
     m4mfn = shadow_alloc(d, SH_type_monitor_table, 0);
     mfn_to_page(m4mfn)->shadow_flags = 4;
 
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -2440,9 +2440,14 @@ static int sh_page_fault(struct vcpu *v,
      * Preallocate shadow pages *before* removing writable accesses
      * otherwhise an OOS L1 might be demoted and promoted again with
      * writable mappings. */
-    shadow_prealloc(d,
-                    SH_type_l1_shadow,
-                    GUEST_PAGING_LEVELS < 4 ? 1 : GUEST_PAGING_LEVELS - 1);
+    if ( !shadow_prealloc(d, SH_type_l1_shadow,
+                          GUEST_PAGING_LEVELS < 4
+                          ? 1 : GUEST_PAGING_LEVELS - 1) )
+    {
+        paging_unlock(d);
+        put_gfn(d, gfn_x(gfn));
+        return 0;
+    }
 
     rc = gw_remove_write_accesses(v, va, &gw);
 
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -383,7 +383,8 @@ void shadow_promote(struct domain *d, mf
 void shadow_demote(struct domain *d, mfn_t gmfn, u32 type);
 
 /* Shadow page allocation functions */
-void  shadow_prealloc(struct domain *d, u32 shadow_type, unsigned int count);
+bool __must_check shadow_prealloc(struct domain *d, unsigned int shadow_type,
+                                  unsigned int count);
 mfn_t shadow_alloc(struct domain *d,
                     u32 shadow_type,
                     unsigned long backpointer);
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: refuse new allocations for dying domains

This will in particular prevent any attempts to add entries to the p2m,
once - in a subsequent change - non-root entries have been removed.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -245,6 +245,9 @@ static struct page_info *hap_alloc(struc
 
     ASSERT(paging_locked_by_me(d));
 
+    if ( unlikely(d->is_dying) )
+        return NULL;
+
     pg = page_list_remove_head(&d->arch.paging.hap.freelist);
     if ( unlikely(!pg) )
         return NULL;
@@ -281,7 +284,7 @@ static struct page_info *hap_alloc_p2m_p
         d->arch.paging.hap.p2m_pages++;
         ASSERT(!page_get_owner(pg) && !(pg->count_info & PGC_count_mask));
     }
-    else if ( !d->arch.paging.p2m_alloc_failed )
+    else if ( !d->arch.paging.p2m_alloc_failed && !d->is_dying )
     {
         d->arch.paging.p2m_alloc_failed = 1;
         dprintk(XENLOG_ERR, "d%i failed to allocate from HAP pool\n",
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -939,6 +939,10 @@ static bool __must_check _shadow_preallo
     if ( d->arch.paging.shadow.free_pages >= pages )
         return true;
 
+    if ( unlikely(d->is_dying) )
+        /* No reclaim when the domain is dying, teardown will take care of it. */
+        return false;
+
     /* Shouldn't have enabled shadows if we've no vcpus. */
     ASSERT(d->vcpu && d->vcpu[0]);
 
@@ -991,7 +995,7 @@ static bool __must_check _shadow_preallo
            d->arch.paging.shadow.free_pages,
            d->arch.paging.shadow.p2m_pages);
 
-    ASSERT(d->is_dying);
+    ASSERT_UNREACHABLE();
 
     guest_flush_tlb_mask(d, d->dirty_cpumask);
 
@@ -1005,10 +1009,13 @@ static bool __must_check _shadow_preallo
  * to avoid freeing shadows that the caller is currently working on. */
 bool shadow_prealloc(struct domain *d, unsigned int type, unsigned int count)
 {
-    bool ret = _shadow_prealloc(d, shadow_size(type) * count);
+    bool ret;
+
+    if ( unlikely(d->is_dying) )
+       return false;
 
-    if ( !ret && !d->is_dying &&
-         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
+    ret = _shadow_prealloc(d, shadow_size(type) * count);
+    if ( !ret && (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
         /*
          * Failing to allocate memory required for shadow usage can only result in
          * a domain crash, do it here rather that relying on every caller to do it.
@@ -1238,6 +1245,9 @@ shadow_alloc_p2m_page(struct domain *d)
 {
     struct page_info *pg = NULL;
 
+    if ( unlikely(d->is_dying) )
+       return NULL;
+
     /* This is called both from the p2m code (which never holds the
      * paging lock) and the log-dirty code (which always does). */
     paging_lock_recursive(d);
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: truly free paging pool memory for dying domains

Modify {hap,shadow}_free to free the page immediately if the domain is
dying, so that pages don't accumulate in the pool when
{shadow,hap}_final_teardown() get called. This is to limit the amount of
work which needs to be done there (in a non-preemptable manner).

Note the call to shadow_free() in shadow_free_p2m_page() is moved after
increasing total_pages, so that the decrease done in shadow_free() in
case the domain is dying doesn't underflow the counter, even if just for
a short interval.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -265,6 +265,18 @@ static void hap_free(struct domain *d, m
 
     ASSERT(paging_locked_by_me(d));
 
+    /*
+     * For dying domains, actually free the memory here. This way less work is
+     * left to hap_final_teardown(), which cannot easily have preemption checks
+     * added.
+     */
+    if ( unlikely(d->is_dying) )
+    {
+        free_domheap_page(pg);
+        d->arch.paging.hap.total_pages--;
+        return;
+    }
+
     d->arch.paging.hap.free_pages++;
     page_list_add_tail(pg, &d->arch.paging.hap.freelist);
 }
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1187,6 +1187,7 @@ mfn_t shadow_alloc(struct domain *d,
 void shadow_free(struct domain *d, mfn_t smfn)
 {
     struct page_info *next = NULL, *sp = mfn_to_page(smfn);
+    bool dying = ACCESS_ONCE(d->is_dying);
     struct page_list_head *pin_list;
     unsigned int pages;
     u32 shadow_type;
@@ -1229,11 +1230,32 @@ void shadow_free(struct domain *d, mfn_t
          * just before the allocator hands the page out again. */
         page_set_tlbflush_timestamp(sp);
         perfc_decr(shadow_alloc_count);
-        page_list_add_tail(sp, &d->arch.paging.shadow.freelist);
+
+        /*
+         * For dying domains, actually free the memory here. This way less
+         * work is left to shadow_final_teardown(), which cannot easily have
+         * preemption checks added.
+         */
+        if ( unlikely(dying) )
+        {
+            /*
+             * The backpointer field (sh.back) used by shadow code aliases the
+             * domain owner field, unconditionally clear it here to avoid
+             * free_domheap_page() attempting to parse it.
+             */
+            page_set_owner(sp, NULL);
+            free_domheap_page(sp);
+        }
+        else
+            page_list_add_tail(sp, &d->arch.paging.shadow.freelist);
+
         sp = next;
     }
 
-    d->arch.paging.shadow.free_pages += pages;
+    if ( unlikely(dying) )
+        d->arch.paging.shadow.total_pages -= pages;
+    else
+        d->arch.paging.shadow.free_pages += pages;
 }
 
 /* Divert a page from the pool to be used by the p2m mapping.
@@ -1303,9 +1325,9 @@ shadow_free_p2m_page(struct domain *d, s
      * paging lock) and the log-dirty code (which always does). */
     paging_lock_recursive(d);
 
-    shadow_free(d, page_to_mfn(pg));
     d->arch.paging.shadow.p2m_pages--;
     d->arch.paging.shadow.total_pages++;
+    shadow_free(d, page_to_mfn(pg));
 
     paging_unlock(d);
 }
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: free the paging memory pool preemptively

The paging memory pool is currently freed in two different places:
from {shadow,hap}_teardown() via domain_relinquish_resources() and
from {shadow,hap}_final_teardown() via complete_domain_destroy().
While the former does handle preemption, the later doesn't.

Attempt to move as much p2m related freeing as possible to happen
before the call to {shadow,hap}_teardown(), so that most memory can be
freed in a preemptive way.  In order to avoid causing issues to
existing callers leave the root p2m page tables set and free them in
{hap,shadow}_final_teardown().  Also modify {hap,shadow}_free to free
the page immediately if the domain is dying, so that pages don't
accumulate in the pool when {shadow,hap}_final_teardown() get called.

Move altp2m_vcpu_disable_ve() to be done in hap_teardown(), as that's
the place where altp2m_active gets disabled now.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -38,7 +38,6 @@
 #include <xen/livepatch.h>
 #include <public/sysctl.h>
 #include <public/hvm/hvm_vcpu.h>
-#include <asm/altp2m.h>
 #include <asm/regs.h>
 #include <asm/mc146818rtc.h>
 #include <asm/system.h>
@@ -2381,12 +2380,6 @@ int domain_relinquish_resources(struct d
             vpmu_destroy(v);
         }
 
-        if ( altp2m_active(d) )
-        {
-            for_each_vcpu ( d, v )
-                altp2m_vcpu_disable_ve(v);
-        }
-
         if ( is_pv_domain(d) )
         {
             for_each_vcpu ( d, v )
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -28,6 +28,7 @@
 #include <xen/domain_page.h>
 #include <xen/guest_access.h>
 #include <xen/keyhandler.h>
+#include <asm/altp2m.h>
 #include <asm/event.h>
 #include <asm/page.h>
 #include <asm/current.h>
@@ -546,24 +547,8 @@ void hap_final_teardown(struct domain *d
     unsigned int i;
 
     if ( hvm_altp2m_supported() )
-    {
-        d->arch.altp2m_active = 0;
-
-        if ( d->arch.altp2m_eptp )
-        {
-            free_xenheap_page(d->arch.altp2m_eptp);
-            d->arch.altp2m_eptp = NULL;
-        }
-
-        if ( d->arch.altp2m_visible_eptp )
-        {
-            free_xenheap_page(d->arch.altp2m_visible_eptp);
-            d->arch.altp2m_visible_eptp = NULL;
-        }
-
         for ( i = 0; i < MAX_ALTP2M; i++ )
             p2m_teardown(d->arch.altp2m_p2m[i], true);
-    }
 
     /* Destroy nestedp2m's first */
     for (i = 0; i < MAX_NESTEDP2M; i++) {
@@ -578,6 +563,8 @@ void hap_final_teardown(struct domain *d
     paging_lock(d);
     hap_set_allocation(d, 0, NULL);
     ASSERT(d->arch.paging.hap.p2m_pages == 0);
+    ASSERT(d->arch.paging.hap.free_pages == 0);
+    ASSERT(d->arch.paging.hap.total_pages == 0);
     paging_unlock(d);
 }
 
@@ -603,6 +590,7 @@ void hap_vcpu_teardown(struct vcpu *v)
 void hap_teardown(struct domain *d, bool *preempted)
 {
     struct vcpu *v;
+    unsigned int i;
 
     ASSERT(d->is_dying);
     ASSERT(d != current->domain);
@@ -611,6 +599,28 @@ void hap_teardown(struct domain *d, bool
     for_each_vcpu ( d, v )
         hap_vcpu_teardown(v);
 
+    /* Leave the root pt in case we get further attempts to modify the p2m. */
+    if ( hvm_altp2m_supported() )
+    {
+        if ( altp2m_active(d) )
+            for_each_vcpu ( d, v )
+                altp2m_vcpu_disable_ve(v);
+
+        d->arch.altp2m_active = 0;
+
+        FREE_XENHEAP_PAGE(d->arch.altp2m_eptp);
+        FREE_XENHEAP_PAGE(d->arch.altp2m_visible_eptp);
+
+        for ( i = 0; i < MAX_ALTP2M; i++ )
+            p2m_teardown(d->arch.altp2m_p2m[i], false);
+    }
+
+    /* Destroy nestedp2m's after altp2m. */
+    for ( i = 0; i < MAX_NESTEDP2M; i++ )
+        p2m_teardown(d->arch.nested_p2m[i], false);
+
+    p2m_teardown(p2m_get_hostp2m(d), false);
+
     paging_lock(d); /* Keep various asserts happy */
 
     if ( d->arch.paging.hap.total_pages != 0 )
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2824,8 +2824,17 @@ void shadow_teardown(struct domain *d, b
     for_each_vcpu ( d, v )
         shadow_vcpu_teardown(v);
 
+    p2m_teardown(p2m_get_hostp2m(d), false);
+
     paging_lock(d);
 
+    /*
+     * Reclaim all shadow memory so that shadow_set_allocation() doesn't find
+     * in-use pages, as _shadow_prealloc() will no longer try to reclaim pages
+     * because the domain is dying.
+     */
+    shadow_blow_tables(d);
+
 #if (SHADOW_OPTIMIZATIONS & (SHOPT_VIRTUAL_TLB|SHOPT_OUT_OF_SYNC))
     /* Free the virtual-TLB array attached to each vcpu */
     for_each_vcpu(d, v)
@@ -2946,6 +2955,9 @@ void shadow_final_teardown(struct domain
                    d->arch.paging.shadow.total_pages,
                    d->arch.paging.shadow.free_pages,
                    d->arch.paging.shadow.p2m_pages);
+    ASSERT(!d->arch.paging.shadow.total_pages);
+    ASSERT(!d->arch.paging.shadow.free_pages);
+    ASSERT(!d->arch.paging.shadow.p2m_pages);
     paging_unlock(d);
 }
 
From: Julien Grall <jgrall@amazon.com>
Subject: xen/x86: p2m: Add preemption in p2m_teardown()

The list p2m->pages contain all the pages used by the P2M. On large
instance this can be quite large and the time spent to call
d->arch.paging.free_page() will take more than 1ms for a 80GB guest
on a Xen running in nested environment on a c5.metal.

By extrapolation, it would take > 100ms for a 8TB guest (what we
current security support). So add some preemption in p2m_teardown()
and propagate to the callers. Note there are 3 places where
the preemption is not enabled:
    - hap_final_teardown()/shadow_final_teardown(): We are
      preventing update the P2M once the domain is dying (so
      no more pages could be allocated) and most of the P2M pages
      will be freed in preemptive manneer when relinquishing the
      resources. So this is fine to disable preemption.
    - shadow_enable(): This is fine because it will undo the allocation
      that may have been made by p2m_alloc_table() (so only the root
      page table).

The preemption is arbitrarily checked every 1024 iterations.

Note that with the current approach, Xen doesn't keep track on whether
the alt/nested P2Ms have been cleared. So there are some redundant work.
However, this is not expected to incurr too much overhead (the P2M lock
shouldn't be contended during teardown). So this is optimization is
left outside of the security event.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -574,7 +574,7 @@ int p2m_init(struct domain *d);
 int p2m_alloc_table(struct p2m_domain *p2m);
 
 /* Return all the p2m resources to Xen. */
-void p2m_teardown(struct p2m_domain *p2m, bool remove_root);
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root, bool *preempted);
 void p2m_final_teardown(struct domain *d);
 
 /* Add a page to a domain's p2m table */
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -548,17 +548,17 @@ void hap_final_teardown(struct domain *d
 
     if ( hvm_altp2m_supported() )
         for ( i = 0; i < MAX_ALTP2M; i++ )
-            p2m_teardown(d->arch.altp2m_p2m[i], true);
+            p2m_teardown(d->arch.altp2m_p2m[i], true, NULL);
 
     /* Destroy nestedp2m's first */
     for (i = 0; i < MAX_NESTEDP2M; i++) {
-        p2m_teardown(d->arch.nested_p2m[i], true);
+        p2m_teardown(d->arch.nested_p2m[i], true, NULL);
     }
 
     if ( d->arch.paging.hap.total_pages != 0 )
         hap_teardown(d, NULL);
 
-    p2m_teardown(p2m_get_hostp2m(d), true);
+    p2m_teardown(p2m_get_hostp2m(d), true, NULL);
     /* Free any memory that the p2m teardown released */
     paging_lock(d);
     hap_set_allocation(d, 0, NULL);
@@ -612,14 +612,24 @@ void hap_teardown(struct domain *d, bool
         FREE_XENHEAP_PAGE(d->arch.altp2m_visible_eptp);
 
         for ( i = 0; i < MAX_ALTP2M; i++ )
-            p2m_teardown(d->arch.altp2m_p2m[i], false);
+        {
+            p2m_teardown(d->arch.altp2m_p2m[i], false, preempted);
+            if ( preempted && *preempted )
+                return;
+        }
     }
 
     /* Destroy nestedp2m's after altp2m. */
     for ( i = 0; i < MAX_NESTEDP2M; i++ )
-        p2m_teardown(d->arch.nested_p2m[i], false);
+    {
+        p2m_teardown(d->arch.nested_p2m[i], false, preempted);
+        if ( preempted && *preempted )
+            return;
+    }
 
-    p2m_teardown(p2m_get_hostp2m(d), false);
+    p2m_teardown(p2m_get_hostp2m(d), false, preempted);
+    if ( preempted && *preempted )
+        return;
 
     paging_lock(d); /* Keep various asserts happy */
 
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -749,12 +749,13 @@ int p2m_alloc_table(struct p2m_domain *p
  * hvm fixme: when adding support for pvh non-hardware domains, this path must
  * cleanup any foreign p2m types (release refcnts on them).
  */
-void p2m_teardown(struct p2m_domain *p2m, bool remove_root)
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root, bool *preempted)
 /* Return all the p2m pages to Xen.
  * We know we don't have any extra mappings to these pages */
 {
     struct page_info *pg, *root_pg = NULL;
     struct domain *d;
+    unsigned int i = 0;
 
     if (p2m == NULL)
         return;
@@ -773,8 +774,19 @@ void p2m_teardown(struct p2m_domain *p2m
     }
 
     while ( (pg = page_list_remove_head(&p2m->pages)) )
-        if ( pg != root_pg )
-            d->arch.paging.free_page(d, pg);
+    {
+        if ( pg == root_pg )
+            continue;
+
+        d->arch.paging.free_page(d, pg);
+
+        /* Arbitrarily check preemption every 1024 iterations */
+        if ( preempted && !(++i % 1024) && general_preempt_check() )
+        {
+            *preempted = true;
+            break;
+        }
+    }
 
     if ( root_pg )
         page_list_add(root_pg, &p2m->pages);
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2770,8 +2770,12 @@ int shadow_enable(struct domain *d, u32
  out_locked:
     paging_unlock(d);
  out_unlocked:
+    /*
+     * This is fine to ignore the preemption here because only the root
+     * will be allocated by p2m_alloc_table().
+     */
     if ( rv != 0 && !pagetable_is_null(p2m_get_pagetable(p2m)) )
-        p2m_teardown(p2m, true);
+        p2m_teardown(p2m, true, NULL);
     if ( rv != 0 && pg != NULL )
     {
         pg->count_info &= ~PGC_count_mask;
@@ -2824,7 +2828,9 @@ void shadow_teardown(struct domain *d, b
     for_each_vcpu ( d, v )
         shadow_vcpu_teardown(v);
 
-    p2m_teardown(p2m_get_hostp2m(d), false);
+    p2m_teardown(p2m_get_hostp2m(d), false, preempted);
+    if ( preempted && *preempted )
+        return;
 
     paging_lock(d);
 
@@ -2945,7 +2951,7 @@ void shadow_final_teardown(struct domain
         shadow_teardown(d, NULL);
 
     /* It is now safe to pull down the p2m map. */
-    p2m_teardown(p2m_get_hostp2m(d), true);
+    p2m_teardown(p2m_get_hostp2m(d), true, NULL);
     /* Free any shadow memory that the p2m teardown released */
     paging_lock(d);
     shadow_set_allocation(d, 0, NULL);
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/HAP: adjust monitor table related error handling

hap_make_monitor_table() will return INVALID_MFN if it encounters an
error condition, but hap_update_paging_modes() wasn’t handling this
value, resulting in an inappropriate value being stored in
monitor_table. This would subsequently misguide at least
hap_vcpu_teardown(). Avoid this by bailing early.

Further, when a domain has/was already crashed or (perhaps less
important as there's no such path known to lead here) is already dying,
avoid calling domain_crash() on it again - that's at best confusing.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
---
v11: Also check for "crashed" status. Re-write / extend description.
v2: New.

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -39,6 +39,7 @@
 #include <asm/domain.h>
 #include <xen/numa.h>
 #include <asm/hvm/nestedhvm.h>
+#include <public/sched.h>
 
 #include "private.h"
 
@@ -405,8 +406,13 @@ static mfn_t hap_make_monitor_table(stru
     return m4mfn;
 
  oom:
-    printk(XENLOG_G_ERR "out of memory building monitor pagetable\n");
-    domain_crash(d);
+    if ( !d->is_dying &&
+         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
+    {
+        printk(XENLOG_G_ERR "%pd: out of memory building monitor pagetable\n",
+               d);
+        domain_crash(d);
+    }
     return INVALID_MFN;
 }
 
@@ -763,6 +769,9 @@ static void cf_check hap_update_paging_m
     if ( pagetable_is_null(v->arch.hvm.monitor_table) )
     {
         mfn_t mmfn = hap_make_monitor_table(v);
+
+        if ( mfn_eq(mmfn, INVALID_MFN) )
+            goto unlock;
         v->arch.hvm.monitor_table = pagetable_from_mfn(mmfn);
         make_cr3(v, mmfn);
         hvm_update_host_cr3(v);
@@ -771,6 +780,7 @@ static void cf_check hap_update_paging_m
     /* CR3 is effectively updated by a mode change. Flush ASIDs, etc. */
     hap_update_cr3(v, 0, false);
 
+ unlock:
     paging_unlock(d);
     put_gfn(d, cr3_gfn);
 }
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/shadow: tolerate failure of sh_set_toplevel_shadow()

Subsequently sh_set_toplevel_shadow() will be adjusted to install a
blank entry in case prealloc fails. There are, in fact, pre-existing
error paths which would put in place a blank entry. The 4- and 2-level
code in sh_update_cr3(), however, assume the top level entry to be
valid.

Hence bail from the function in the unlikely event that it's not. Note
that 3-level logic works differently: In particular a guest is free to
supply a PDPTR pointing at 4 non-present (or otherwise deemed invalid)
entries. The guest will crash, but we already cope with that.

Really mfn_valid() is likely wrong to use in sh_set_toplevel_shadow(),
and it should instead be !mfn_eq(gmfn, INVALID_MFN). Avoid such a change
in security context, but add a respective assertion.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v11:
 - Justify ASSERT() in sh_set_toplevel_shadow().

Changes since v9:
 - Add assertions. Commit message adjustments.

Changes since v7:
 - New in this version.

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2521,6 +2521,7 @@ void sh_set_toplevel_shadow(struct vcpu
     /* Now figure out the new contents: is this a valid guest MFN? */
     if ( !mfn_valid(gmfn) )
     {
+        ASSERT(mfn_eq(gmfn, INVALID_MFN));
         new_entry = pagetable_null();
         goto install_new_entry;
     }
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3316,6 +3316,11 @@ static void cf_check sh_update_cr3(struc
     if ( sh_remove_write_access(d, gmfn, 4, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
     sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l4_shadow, sh_make_shadow);
+    if ( unlikely(pagetable_is_null(v->arch.paging.shadow.shadow_table[0])) )
+    {
+        ASSERT(d->is_dying || d->is_shutting_down);
+        return;
+    }
     if ( !shadow_mode_external(d) && !is_pv_32bit_domain(d) )
     {
         mfn_t smfn = pagetable_get_mfn(v->arch.paging.shadow.shadow_table[0]);
@@ -3372,6 +3377,11 @@ static void cf_check sh_update_cr3(struc
     if ( sh_remove_write_access(d, gmfn, 2, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
     sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l2_shadow, sh_make_shadow);
+    if ( unlikely(pagetable_is_null(v->arch.paging.shadow.shadow_table[0])) )
+    {
+        ASSERT(d->is_dying || d->is_shutting_down);
+        return;
+    }
 #else
 #error This should never happen
 #endif
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/shadow: tolerate failure in shadow_prealloc()

Prevent _shadow_prealloc() from calling BUG() when unable to fulfill
the pre-allocation and instead return true/false.  Modify
shadow_prealloc() to crash the domain on allocation failure (if the
domain is not already dying), as shadow cannot operate normally after
that.  Modify callers to also gracefully handle {_,}shadow_prealloc()
failing to fulfill the request.

Note this in turn requires adjusting the callers of
sh_make_monitor_table() also to handle it returning INVALID_MFN.
sh_update_paging_modes() is also modified to add additional error
paths in case of allocation failure, some of those will return with
null monitor page tables (and the domain likely crashed).  This is no
different that current error paths, but the newly introduced ones are
more likely to trigger.

The now added failure points in sh_update_paging_modes() also require
that on some error return paths the previous structures are cleared,
and thus monitor table is null.

While there adjust the 'type' parameter type of shadow_prealloc() to
unsigned int rather than u32.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
---
Andrew pointing out that ->is_shutting_down can be reset again, I wonder
whether resuming a crashed domain wouldn't better be refused, if for
nothing else than its Xen-internal state potentially not allowing
further execution. Of course then a question is whether soft-reset
tidies state enough for a guest to become resumable.
---
Changes since v9:
 - INVALID_MFN rather than "invalid mfn" in description.

Changes since v7:
 - Refine conditional in shadow_prealloc().
 - Add TLB flush and assertion to _shadow_prealloc().
 - Re-wrap long line.

Changes since v6:
 - Use d->is_dying instead of is_dying_domain().

Changes since v5:
 - Remove __must_check from definition.

Changes since v4:
 - Add __must_check attributes to {_,}shadow_prealloc().
 - Pre-allocate all OOS pages outside of the allocation loop, so that
   we don't return to the caller with a partially populated OOS array.
 - Add an out label to shadow_alloc_p2m_page().
 - Crash the domain in shadow_prealloc() rather than
   _shadow_prealloc(), so that the call in shadow_set_allocation()
   avoids the domain_crash on failure.
 - Do not crash the domain if already dying.

Changes since v3:
 - New in this version.

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -36,6 +36,7 @@
 #include <asm/flushtlb.h>
 #include <asm/shadow.h>
 #include <xen/numa.h>
+#include <public/sched.h>
 #include "private.h"
 
 DEFINE_PER_CPU(uint32_t,trace_shadow_path_flags);
@@ -928,14 +929,15 @@ static inline void trace_shadow_prealloc
 
 /* Make sure there are at least count order-sized pages
  * available in the shadow page pool. */
-static void _shadow_prealloc(struct domain *d, unsigned int pages)
+static bool __must_check _shadow_prealloc(struct domain *d, unsigned int pages)
 {
     struct vcpu *v;
     struct page_info *sp, *t;
     mfn_t smfn;
     int i;
 
-    if ( d->arch.paging.shadow.free_pages >= pages ) return;
+    if ( d->arch.paging.shadow.free_pages >= pages )
+        return true;
 
     /* Shouldn't have enabled shadows if we've no vcpus. */
     ASSERT(d->vcpu && d->vcpu[0]);
@@ -951,7 +953,8 @@ static void _shadow_prealloc(struct doma
         sh_unpin(d, smfn);
 
         /* See if that freed up enough space */
-        if ( d->arch.paging.shadow.free_pages >= pages ) return;
+        if ( d->arch.paging.shadow.free_pages >= pages )
+            return true;
     }
 
     /* Stage two: all shadow pages are in use in hierarchies that are
@@ -974,7 +977,7 @@ static void _shadow_prealloc(struct doma
                 if ( d->arch.paging.shadow.free_pages >= pages )
                 {
                     guest_flush_tlb_mask(d, d->dirty_cpumask);
-                    return;
+                    return true;
                 }
             }
         }
@@ -987,7 +990,12 @@ static void _shadow_prealloc(struct doma
            d->arch.paging.shadow.total_pages,
            d->arch.paging.shadow.free_pages,
            d->arch.paging.shadow.p2m_pages);
-    BUG();
+
+    ASSERT(d->is_dying);
+
+    guest_flush_tlb_mask(d, d->dirty_cpumask);
+
+    return false;
 }
 
 /* Make sure there are at least count pages of the order according to
@@ -995,9 +1003,19 @@ static void _shadow_prealloc(struct doma
  * This must be called before any calls to shadow_alloc().  Since this
  * will free existing shadows to make room, it must be called early enough
  * to avoid freeing shadows that the caller is currently working on. */
-void shadow_prealloc(struct domain *d, u32 type, unsigned int count)
+bool shadow_prealloc(struct domain *d, unsigned int type, unsigned int count)
 {
-    return _shadow_prealloc(d, shadow_size(type) * count);
+    bool ret = _shadow_prealloc(d, shadow_size(type) * count);
+
+    if ( !ret && !d->is_dying &&
+         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
+        /*
+         * Failing to allocate memory required for shadow usage can only result in
+         * a domain crash, do it here rather that relying on every caller to do it.
+         */
+        domain_crash(d);
+
+    return ret;
 }
 
 /* Deliberately free all the memory we can: this will tear down all of
@@ -1218,7 +1236,7 @@ void shadow_free(struct domain *d, mfn_t
 static struct page_info *cf_check
 shadow_alloc_p2m_page(struct domain *d)
 {
-    struct page_info *pg;
+    struct page_info *pg = NULL;
 
     /* This is called both from the p2m code (which never holds the
      * paging lock) and the log-dirty code (which always does). */
@@ -1236,16 +1254,18 @@ shadow_alloc_p2m_page(struct domain *d)
                     d->arch.paging.shadow.p2m_pages,
                     shadow_min_acceptable_pages(d));
         }
-        paging_unlock(d);
-        return NULL;
+        goto out;
     }
 
-    shadow_prealloc(d, SH_type_p2m_table, 1);
+    if ( !shadow_prealloc(d, SH_type_p2m_table, 1) )
+        goto out;
+
     pg = mfn_to_page(shadow_alloc(d, SH_type_p2m_table, 0));
     d->arch.paging.shadow.p2m_pages++;
     d->arch.paging.shadow.total_pages--;
     ASSERT(!page_get_owner(pg) && !(pg->count_info & PGC_count_mask));
 
+ out:
     paging_unlock(d);
 
     return pg;
@@ -1336,7 +1356,9 @@ int shadow_set_allocation(struct domain
         else if ( d->arch.paging.shadow.total_pages > pages )
         {
             /* Need to return memory to domheap */
-            _shadow_prealloc(d, 1);
+            if ( !_shadow_prealloc(d, 1) )
+                return -ENOMEM;
+
             sp = page_list_remove_head(&d->arch.paging.shadow.freelist);
             ASSERT(sp);
             /*
@@ -2339,12 +2361,13 @@ static void sh_update_paging_modes(struc
     if ( mfn_eq(v->arch.paging.shadow.oos_snapshot[0], INVALID_MFN) )
     {
         int i;
+
+        if ( !shadow_prealloc(d, SH_type_oos_snapshot, SHADOW_OOS_PAGES) )
+            return;
+
         for(i = 0; i < SHADOW_OOS_PAGES; i++)
-        {
-            shadow_prealloc(d, SH_type_oos_snapshot, 1);
             v->arch.paging.shadow.oos_snapshot[i] =
                 shadow_alloc(d, SH_type_oos_snapshot, 0);
-        }
     }
 #endif /* OOS */
 
@@ -2408,6 +2431,9 @@ static void sh_update_paging_modes(struc
             mfn_t mmfn = sh_make_monitor_table(
                              v, v->arch.paging.mode->shadow.shadow_levels);
 
+            if ( mfn_eq(mmfn, INVALID_MFN) )
+                return;
+
             v->arch.hvm.monitor_table = pagetable_from_mfn(mmfn);
             make_cr3(v, mmfn);
             hvm_update_host_cr3(v);
@@ -2446,6 +2472,12 @@ static void sh_update_paging_modes(struc
                 v->arch.hvm.monitor_table = pagetable_null();
                 new_mfn = sh_make_monitor_table(
                               v, v->arch.paging.mode->shadow.shadow_levels);
+                if ( mfn_eq(new_mfn, INVALID_MFN) )
+                {
+                    sh_destroy_monitor_table(v, old_mfn,
+                                             old_mode->shadow.shadow_levels);
+                    return;
+                }
                 v->arch.hvm.monitor_table = pagetable_from_mfn(new_mfn);
                 SHADOW_PRINTK("new monitor table %"PRI_mfn "\n",
                                mfn_x(new_mfn));
@@ -2531,7 +2563,12 @@ void sh_set_toplevel_shadow(struct vcpu
     if ( !mfn_valid(smfn) )
     {
         /* Make sure there's enough free shadow memory. */
-        shadow_prealloc(d, root_type, 1);
+        if ( !shadow_prealloc(d, root_type, 1) )
+        {
+            new_entry = pagetable_null();
+            goto install_new_entry;
+        }
+
         /* Shadow the page. */
         smfn = make_shadow(v, gmfn, root_type);
     }
--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -697,7 +697,9 @@ mfn_t sh_make_monitor_table(const struct
     ASSERT(!pagetable_get_pfn(v->arch.hvm.monitor_table));
 
     /* Guarantee we can get the memory we need */
-    shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS);
+    if ( !shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS) )
+        return INVALID_MFN;
+
     m4mfn = shadow_alloc(d, SH_type_monitor_table, 0);
     mfn_to_page(m4mfn)->shadow_flags = 4;
 
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -2447,9 +2447,14 @@ static int cf_check sh_page_fault(
      * Preallocate shadow pages *before* removing writable accesses
      * otherwhise an OOS L1 might be demoted and promoted again with
      * writable mappings. */
-    shadow_prealloc(d,
-                    SH_type_l1_shadow,
-                    GUEST_PAGING_LEVELS < 4 ? 1 : GUEST_PAGING_LEVELS - 1);
+    if ( !shadow_prealloc(d, SH_type_l1_shadow,
+                          GUEST_PAGING_LEVELS < 4
+                          ? 1 : GUEST_PAGING_LEVELS - 1) )
+    {
+        paging_unlock(d);
+        put_gfn(d, gfn_x(gfn));
+        return 0;
+    }
 
     rc = gw_remove_write_accesses(v, va, &gw);
 
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -383,7 +383,8 @@ void shadow_promote(struct domain *d, mf
 void shadow_demote(struct domain *d, mfn_t gmfn, u32 type);
 
 /* Shadow page allocation functions */
-void  shadow_prealloc(struct domain *d, u32 shadow_type, unsigned int count);
+bool __must_check shadow_prealloc(struct domain *d, unsigned int shadow_type,
+                                  unsigned int count);
 mfn_t shadow_alloc(struct domain *d,
                     u32 shadow_type,
                     unsigned long backpointer);
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: refuse new allocations for dying domains

This will in particular prevent any attempts to add entries to the p2m,
once - in a subsequent change - non-root entries have been removed.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
---
Changes since v9:
 - Split off from later patch.

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -245,6 +245,9 @@ static struct page_info *hap_alloc(struc
 
     ASSERT(paging_locked_by_me(d));
 
+    if ( unlikely(d->is_dying) )
+        return NULL;
+
     pg = page_list_remove_head(&d->arch.paging.hap.freelist);
     if ( unlikely(!pg) )
         return NULL;
@@ -281,7 +284,7 @@ static struct page_info *cf_check hap_al
         d->arch.paging.hap.p2m_pages++;
         ASSERT(!page_get_owner(pg) && !(pg->count_info & PGC_count_mask));
     }
-    else if ( !d->arch.paging.p2m_alloc_failed )
+    else if ( !d->arch.paging.p2m_alloc_failed && !d->is_dying )
     {
         d->arch.paging.p2m_alloc_failed = 1;
         dprintk(XENLOG_ERR, "d%i failed to allocate from HAP pool\n",
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -939,6 +939,10 @@ static bool __must_check _shadow_preallo
     if ( d->arch.paging.shadow.free_pages >= pages )
         return true;
 
+    if ( unlikely(d->is_dying) )
+        /* No reclaim when the domain is dying, teardown will take care of it. */
+        return false;
+
     /* Shouldn't have enabled shadows if we've no vcpus. */
     ASSERT(d->vcpu && d->vcpu[0]);
 
@@ -991,7 +995,7 @@ static bool __must_check _shadow_preallo
            d->arch.paging.shadow.free_pages,
            d->arch.paging.shadow.p2m_pages);
 
-    ASSERT(d->is_dying);
+    ASSERT_UNREACHABLE();
 
     guest_flush_tlb_mask(d, d->dirty_cpumask);
 
@@ -1005,10 +1009,13 @@ static bool __must_check _shadow_preallo
  * to avoid freeing shadows that the caller is currently working on. */
 bool shadow_prealloc(struct domain *d, unsigned int type, unsigned int count)
 {
-    bool ret = _shadow_prealloc(d, shadow_size(type) * count);
+    bool ret;
+
+    if ( unlikely(d->is_dying) )
+       return false;
 
-    if ( !ret && !d->is_dying &&
-         (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
+    ret = _shadow_prealloc(d, shadow_size(type) * count);
+    if ( !ret && (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) )
         /*
          * Failing to allocate memory required for shadow usage can only result in
          * a domain crash, do it here rather that relying on every caller to do it.
@@ -1238,6 +1245,9 @@ shadow_alloc_p2m_page(struct domain *d)
 {
     struct page_info *pg = NULL;
 
+    if ( unlikely(d->is_dying) )
+       return NULL;
+
     /* This is called both from the p2m code (which never holds the
      * paging lock) and the log-dirty code (which always does). */
     paging_lock_recursive(d);
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: truly free paging pool memory for dying domains

Modify {hap,shadow}_free to free the page immediately if the domain is
dying, so that pages don't accumulate in the pool when
{shadow,hap}_final_teardown() get called. This is to limit the amount of
work which needs to be done there (in a non-preemptable manner).

Note the call to shadow_free() in shadow_free_p2m_page() is moved after
increasing total_pages, so that the decrease done in shadow_free() in
case the domain is dying doesn't underflow the counter, even if just for
a short interval.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
---
Changes since v9:
 - Split off from later patch. Add comments.

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -265,6 +265,18 @@ static void hap_free(struct domain *d, m
 
     ASSERT(paging_locked_by_me(d));
 
+    /*
+     * For dying domains, actually free the memory here. This way less work is
+     * left to hap_final_teardown(), which cannot easily have preemption checks
+     * added.
+     */
+    if ( unlikely(d->is_dying) )
+    {
+        free_domheap_page(pg);
+        d->arch.paging.hap.total_pages--;
+        return;
+    }
+
     d->arch.paging.hap.free_pages++;
     page_list_add_tail(pg, &d->arch.paging.hap.freelist);
 }
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1187,6 +1187,7 @@ mfn_t shadow_alloc(struct domain *d,
 void shadow_free(struct domain *d, mfn_t smfn)
 {
     struct page_info *next = NULL, *sp = mfn_to_page(smfn);
+    bool dying = ACCESS_ONCE(d->is_dying);
     struct page_list_head *pin_list;
     unsigned int pages;
     u32 shadow_type;
@@ -1229,11 +1230,32 @@ void shadow_free(struct domain *d, mfn_t
          * just before the allocator hands the page out again. */
         page_set_tlbflush_timestamp(sp);
         perfc_decr(shadow_alloc_count);
-        page_list_add_tail(sp, &d->arch.paging.shadow.freelist);
+
+        /*
+         * For dying domains, actually free the memory here. This way less
+         * work is left to shadow_final_teardown(), which cannot easily have
+         * preemption checks added.
+         */
+        if ( unlikely(dying) )
+        {
+            /*
+             * The backpointer field (sh.back) used by shadow code aliases the
+             * domain owner field, unconditionally clear it here to avoid
+             * free_domheap_page() attempting to parse it.
+             */
+            page_set_owner(sp, NULL);
+            free_domheap_page(sp);
+        }
+        else
+            page_list_add_tail(sp, &d->arch.paging.shadow.freelist);
+
         sp = next;
     }
 
-    d->arch.paging.shadow.free_pages += pages;
+    if ( unlikely(dying) )
+        d->arch.paging.shadow.total_pages -= pages;
+    else
+        d->arch.paging.shadow.free_pages += pages;
 }
 
 /* Divert a page from the pool to be used by the p2m mapping.
@@ -1303,9 +1325,9 @@ shadow_free_p2m_page(struct domain *d, s
      * paging lock) and the log-dirty code (which always does). */
     paging_lock_recursive(d);
 
-    shadow_free(d, page_to_mfn(pg));
     d->arch.paging.shadow.p2m_pages--;
     d->arch.paging.shadow.total_pages++;
+    shadow_free(d, page_to_mfn(pg));
 
     paging_unlock(d);
 }
From: Roger Pau Monné <roger.pau@citrix.com>
Subject: x86/p2m: free the paging memory pool preemptively

The paging memory pool is currently freed in two different places:
from {shadow,hap}_teardown() via domain_relinquish_resources() and
from {shadow,hap}_final_teardown() via complete_domain_destroy().
While the former does handle preemption, the later doesn't.

Attempt to move as much p2m related freeing as possible to happen
before the call to {shadow,hap}_teardown(), so that most memory can be
freed in a preemptive way.  In order to avoid causing issues to
existing callers leave the root p2m page tables set and free them in
{hap,shadow}_final_teardown().  Also modify {hap,shadow}_free to free
the page immediately if the domain is dying, so that pages don't
accumulate in the pool when {shadow,hap}_final_teardown() get called.

Move altp2m_vcpu_disable_ve() to be done in hap_teardown(), as that's
the place where altp2m_active gets disabled now.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
---
Changes since v9:
- Split parts off into new earlier patches.

Changes since v7:
 - Re-base over changes earlier in the series.
 - Convert ASSERT() to ASSERT_UNREACHABLE() in _shadow_prealloc().
 - Avoid double blank lines.

Changes since v6:
 - Fail allocation and return pages to domheap on free only when
   d->is_dying, otherwise still running vCPUs could malfunction.

Changes since v4:
 - Use the newly introduced is_dying_domain().
 - Clear the page owner field in shadow_free() before returning to
   domheap.
 - Use preemption in the shadow_blow_tables() call in
   shadow_teardown().

Changes since v3:
 - Use FREE_XENHEAP_PAGE.
 - Clear v.sh.back instead of the domain owner in shadow_free().
 - Fail pre-allocation in _shadow_prealloc() when dying.

Changes since v2:
 - Free memory when returned to the hap/shadow pool if the domain is
   dying.
 - Restore shadow_teardown() call in shadow_final_teardown().
 - Expand commit message.
 - Do not print an error message if paging allocation fails when the
   domain is dying.

Changes since v1:
 - Fix calls to altp2m_vcpu_disable_ve().
 - Fix some style issues of moved code.
 - Fix {hap,shadow}_final_teardown() to empty the pool, in case it
   gets called from the failed domain creation path.
 - Leave the root page table(s) in place until
   {shadow,hap}_final_teardown().
 - Prevent p2m memory allocation if the domain is dying.

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -38,7 +38,6 @@
 #include <xen/livepatch.h>
 #include <public/sysctl.h>
 #include <public/hvm/hvm_vcpu.h>
-#include <asm/altp2m.h>
 #include <asm/regs.h>
 #include <asm/mc146818rtc.h>
 #include <asm/system.h>
@@ -2406,12 +2405,6 @@ int domain_relinquish_resources(struct d
             vpmu_destroy(v);
         }
 
-        if ( altp2m_active(d) )
-        {
-            for_each_vcpu ( d, v )
-                altp2m_vcpu_disable_ve(v);
-        }
-
         if ( is_pv_domain(d) )
         {
             for_each_vcpu ( d, v )
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -28,6 +28,7 @@
 #include <xen/domain_page.h>
 #include <xen/guest_access.h>
 #include <xen/keyhandler.h>
+#include <asm/altp2m.h>
 #include <asm/event.h>
 #include <asm/page.h>
 #include <asm/current.h>
@@ -546,24 +547,8 @@ void hap_final_teardown(struct domain *d
     unsigned int i;
 
     if ( hvm_altp2m_supported() )
-    {
-        d->arch.altp2m_active = 0;
-
-        if ( d->arch.altp2m_eptp )
-        {
-            free_xenheap_page(d->arch.altp2m_eptp);
-            d->arch.altp2m_eptp = NULL;
-        }
-
-        if ( d->arch.altp2m_visible_eptp )
-        {
-            free_xenheap_page(d->arch.altp2m_visible_eptp);
-            d->arch.altp2m_visible_eptp = NULL;
-        }
-
         for ( i = 0; i < MAX_ALTP2M; i++ )
             p2m_teardown(d->arch.altp2m_p2m[i], true);
-    }
 
     /* Destroy nestedp2m's first */
     for (i = 0; i < MAX_NESTEDP2M; i++) {
@@ -578,6 +563,8 @@ void hap_final_teardown(struct domain *d
     paging_lock(d);
     hap_set_allocation(d, 0, NULL);
     ASSERT(d->arch.paging.hap.p2m_pages == 0);
+    ASSERT(d->arch.paging.hap.free_pages == 0);
+    ASSERT(d->arch.paging.hap.total_pages == 0);
     paging_unlock(d);
 }
 
@@ -603,6 +590,7 @@ void hap_vcpu_teardown(struct vcpu *v)
 void hap_teardown(struct domain *d, bool *preempted)
 {
     struct vcpu *v;
+    unsigned int i;
 
     ASSERT(d->is_dying);
     ASSERT(d != current->domain);
@@ -611,6 +599,28 @@ void hap_teardown(struct domain *d, bool
     for_each_vcpu ( d, v )
         hap_vcpu_teardown(v);
 
+    /* Leave the root pt in case we get further attempts to modify the p2m. */
+    if ( hvm_altp2m_supported() )
+    {
+        if ( altp2m_active(d) )
+            for_each_vcpu ( d, v )
+                altp2m_vcpu_disable_ve(v);
+
+        d->arch.altp2m_active = 0;
+
+        FREE_XENHEAP_PAGE(d->arch.altp2m_eptp);
+        FREE_XENHEAP_PAGE(d->arch.altp2m_visible_eptp);
+
+        for ( i = 0; i < MAX_ALTP2M; i++ )
+            p2m_teardown(d->arch.altp2m_p2m[i], false);
+    }
+
+    /* Destroy nestedp2m's after altp2m. */
+    for ( i = 0; i < MAX_NESTEDP2M; i++ )
+        p2m_teardown(d->arch.nested_p2m[i], false);
+
+    p2m_teardown(p2m_get_hostp2m(d), false);
+
     paging_lock(d); /* Keep various asserts happy */
 
     if ( d->arch.paging.hap.total_pages != 0 )
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2831,8 +2831,17 @@ void shadow_teardown(struct domain *d, b
     for_each_vcpu ( d, v )
         shadow_vcpu_teardown(v);
 
+    p2m_teardown(p2m_get_hostp2m(d), false);
+
     paging_lock(d);
 
+    /*
+     * Reclaim all shadow memory so that shadow_set_allocation() doesn't find
+     * in-use pages, as _shadow_prealloc() will no longer try to reclaim pages
+     * because the domain is dying.
+     */
+    shadow_blow_tables(d);
+
 #if (SHADOW_OPTIMIZATIONS & (SHOPT_VIRTUAL_TLB|SHOPT_OUT_OF_SYNC))
     /* Free the virtual-TLB array attached to each vcpu */
     for_each_vcpu(d, v)
@@ -2953,6 +2962,9 @@ void shadow_final_teardown(struct domain
                    d->arch.paging.shadow.total_pages,
                    d->arch.paging.shadow.free_pages,
                    d->arch.paging.shadow.p2m_pages);
+    ASSERT(!d->arch.paging.shadow.total_pages);
+    ASSERT(!d->arch.paging.shadow.free_pages);
+    ASSERT(!d->arch.paging.shadow.p2m_pages);
     paging_unlock(d);
 }
 
From: Julien Grall <jgrall@amazon.com>
Subject: xen/x86: p2m: Add preemption in p2m_teardown()

The list p2m->pages contain all the pages used by the P2M. On large
instance this can be quite large and the time spent to call
d->arch.paging.free_page() will take more than 1ms for a 80GB guest
on a Xen running in nested environment on a c5.metal.

By extrapolation, it would take > 100ms for a 8TB guest (what we
current security support). So add some preemption in p2m_teardown()
and propagate to the callers. Note there are 3 places where
the preemption is not enabled:
    - hap_final_teardown()/shadow_final_teardown(): We are
      preventing update the P2M once the domain is dying (so
      no more pages could be allocated) and most of the P2M pages
      will be freed in preemptive manneer when relinquishing the
      resources. So this is fine to disable preemption.
    - shadow_enable(): This is fine because it will undo the allocation
      that may have been made by p2m_alloc_table() (so only the root
      page table).

The preemption is arbitrarily checked every 1024 iterations.

We now need to include <xen/event.h> in p2m-basic in order to
import the definition for local_events_need_delivery() used by
general_preempt_check(). Ideally, the inclusion should happen in
xen/sched.h but it opened a can of worms.

Note that with the current approach, Xen doesn't keep track on whether
the alt/nested P2Ms have been cleared. So there are some redundant work.
However, this is not expected to incurr too much overhead (the P2M lock
shouldn't be contended during teardown). So this is optimization is
left outside of the security event.

This is part of CVE-2022-33746 / XSA-410.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
----
Changes since v12:
    - Correct altp2m preemption check placement.

Changes since v9:
    - Integrate patch into series.

Changes since v2:
    - Rework the loop doing the preemption
    - Add a comment in shadow_enable() to explain why p2m_teardown()
      doesn't need to be preemptible.

Changes since v1:
    - Update the commit message
    - Rebase on top of Roger's v8 series
    - Fix preemption check
    - Use 'unsigned int' rather than 'unsigned long' for the counter

--- a/xen/arch/x86/include/asm/p2m.h
+++ b/xen/arch/x86/include/asm/p2m.h
@@ -600,7 +600,7 @@ int p2m_init(struct domain *d);
 int p2m_alloc_table(struct p2m_domain *p2m);
 
 /* Return all the p2m resources to Xen. */
-void p2m_teardown(struct p2m_domain *p2m, bool remove_root);
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root, bool *preempted);
 void p2m_final_teardown(struct domain *d);
 
 /* Add/remove a page to/from a domain's p2m table. */
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -548,17 +548,17 @@ void hap_final_teardown(struct domain *d
 
     if ( hvm_altp2m_supported() )
         for ( i = 0; i < MAX_ALTP2M; i++ )
-            p2m_teardown(d->arch.altp2m_p2m[i], true);
+            p2m_teardown(d->arch.altp2m_p2m[i], true, NULL);
 
     /* Destroy nestedp2m's first */
     for (i = 0; i < MAX_NESTEDP2M; i++) {
-        p2m_teardown(d->arch.nested_p2m[i], true);
+        p2m_teardown(d->arch.nested_p2m[i], true, NULL);
     }
 
     if ( d->arch.paging.hap.total_pages != 0 )
         hap_teardown(d, NULL);
 
-    p2m_teardown(p2m_get_hostp2m(d), true);
+    p2m_teardown(p2m_get_hostp2m(d), true, NULL);
     /* Free any memory that the p2m teardown released */
     paging_lock(d);
     hap_set_allocation(d, 0, NULL);
@@ -612,14 +612,24 @@ void hap_teardown(struct domain *d, bool
         FREE_XENHEAP_PAGE(d->arch.altp2m_visible_eptp);
 
         for ( i = 0; i < MAX_ALTP2M; i++ )
-            p2m_teardown(d->arch.altp2m_p2m[i], false);
+        {
+            p2m_teardown(d->arch.altp2m_p2m[i], false, preempted);
+            if ( preempted && *preempted )
+                return;
+        }
     }
 
     /* Destroy nestedp2m's after altp2m. */
     for ( i = 0; i < MAX_NESTEDP2M; i++ )
-        p2m_teardown(d->arch.nested_p2m[i], false);
+    {
+        p2m_teardown(d->arch.nested_p2m[i], false, preempted);
+        if ( preempted && *preempted )
+            return;
+    }
 
-    p2m_teardown(p2m_get_hostp2m(d), false);
+    p2m_teardown(p2m_get_hostp2m(d), false, preempted);
+    if ( preempted && *preempted )
+        return;
 
     paging_lock(d); /* Keep various asserts happy */
 
--- a/xen/arch/x86/mm/p2m-basic.c
+++ b/xen/arch/x86/mm/p2m-basic.c
@@ -23,6 +23,7 @@
  * along with this program; If not, see <http://www.gnu.org/licenses/>.
  */
 
+#include <xen/event.h>
 #include <xen/types.h>
 #include <asm/p2m.h>
 #include "mm-locks.h"
@@ -154,11 +155,12 @@ int p2m_init(struct domain *d)
  * hvm fixme: when adding support for pvh non-hardware domains, this path must
  * cleanup any foreign p2m types (release refcnts on them).
  */
-void p2m_teardown(struct p2m_domain *p2m, bool remove_root)
+void p2m_teardown(struct p2m_domain *p2m, bool remove_root, bool *preempted)
 {
 #ifdef CONFIG_HVM
     struct page_info *pg, *root_pg = NULL;
     struct domain *d;
+    unsigned int i = 0;
 
     if ( !p2m )
         return;
@@ -180,8 +182,19 @@ void p2m_teardown(struct p2m_domain *p2m
     }
 
     while ( (pg = page_list_remove_head(&p2m->pages)) )
-        if ( pg != root_pg )
-            d->arch.paging.free_page(d, pg);
+    {
+        if ( pg == root_pg )
+            continue;
+
+        d->arch.paging.free_page(d, pg);
+
+        /* Arbitrarily check preemption every 1024 iterations */
+        if ( preempted && !(++i % 1024) && general_preempt_check() )
+        {
+            *preempted = true;
+            break;
+        }
+    }
 
     if ( root_pg )
         page_list_add(root_pg, &p2m->pages);
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2776,8 +2776,12 @@ int shadow_enable(struct domain *d, u32
     paging_unlock(d);
  out_unlocked:
 #ifdef CONFIG_HVM
+    /*
+     * This is fine to ignore the preemption here because only the root
+     * will be allocated by p2m_alloc_table().
+     */
     if ( rv != 0 && !pagetable_is_null(p2m_get_pagetable(p2m)) )
-        p2m_teardown(p2m, true);
+        p2m_teardown(p2m, true, NULL);
 #endif
     if ( rv != 0 && pg != NULL )
     {
@@ -2831,7 +2835,9 @@ void shadow_teardown(struct domain *d, b
     for_each_vcpu ( d, v )
         shadow_vcpu_teardown(v);
 
-    p2m_teardown(p2m_get_hostp2m(d), false);
+    p2m_teardown(p2m_get_hostp2m(d), false, preempted);
+    if ( preempted && *preempted )
+        return;
 
     paging_lock(d);
 
@@ -2952,7 +2958,7 @@ void shadow_final_teardown(struct domain
         shadow_teardown(d, NULL);
 
     /* It is now safe to pull down the p2m map. */
-    p2m_teardown(p2m_get_hostp2m(d), true);
+    p2m_teardown(p2m_get_hostp2m(d), true, NULL);
     /* Free any shadow memory that the p2m teardown released */
     paging_lock(d);
     shadow_set_allocation(d, 0, NULL);