Xen Security Advisory 438 v2 (CVE-2023-34322) - top-level shadow reference dropped too early for 64-bit PV guests

Xen.org security team posted 1 patch 7 months, 1 week ago
Failed in applying to current master (apply log)
Xen Security Advisory 438 v2 (CVE-2023-34322) - top-level shadow reference dropped too early for 64-bit PV guests
Posted by Xen.org security team 7 months, 1 week ago
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

            Xen Security Advisory CVE-2023-34322 / XSA-438
                               version 2

   top-level shadow reference dropped too early for 64-bit PV guests

UPDATES IN VERSION 2
====================

Public release.

ISSUE DESCRIPTION
=================

For migration as well as to work around kernels unaware of L1TF (see
XSA-273), PV guests may be run in shadow paging mode.  Since Xen itself
needs to be mapped when PV guests run, Xen and shadowed PV guests run
directly the respective shadow page tables.  For 64-bit PV guests this
means running on the shadow of the guest root page table.

In the course of dealing with shortage of memory in the shadow pool
associated with a domain, shadows of page tables may be torn down.  This
tearing down may include the shadow root page table that the CPU in
question is presently running on.  While a precaution exists to
supposedly prevent the tearing down of the underlying live page table,
the time window covered by that precaution isn't large enough.

IMPACT
======

Privilege escalation, Denial of Service (DoS) affecting the entire host,
and information leaks all cannot be ruled out.

VULNERABLE SYSTEMS
==================

All Xen versions from at least 3.2 onwards are vulnerable.  Earlier
versions have not been inspected.

Only x86 systems are vulnerable.  Only 64-bit PV guests can leverage the
vulnerability, and only when running in shadow mode.  Shadow mode would
be in use when migrating guests or as a workaround for XSA-273 (L1TF).

MITIGATION
==========

Running only HVM or PVH guests will avoid the vulnerability.

Running PV guests in the PV shim will also avoid the vulnerability.

CREDITS
=======

This issue was discovered by Tim Deegan, and Jan Beulich of SUSE.

RESOLUTION
==========

Applying the appropriate attached patch resolves this issue.

Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball.  Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.

xsa438.patch           xen-unstable
xsa438-4.17.patch      Xen 4.17.x
xsa438-4.16.patch      Xen 4.16.x
xsa438-4.15.patch      Xen 4.15.x

$ sha256sum xsa438*
f30067fa3732fb52042b14a2836b610c29af47461425f1a1ccec21cb8a5a48b1  xsa438.patch
a2e7d7c12ea19fb95e2d825fda5f7d0124cbb5c4a369cb58ab6036d266b7e297  xsa438-4.15.patch
eb75fbeb4aa635d6104c12acd5f7311e477f7c159f2ec4eca8a345327a9aee24  xsa438-4.16.patch
f3a305c86124e48b9afa14f3ba76b81d1f5d8d472e2412ae3d014305c749a86a  xsa438-4.17.patch
$

DEPLOYMENT DURING EMBARGO
=========================

Deployment of the patches and/or mitigations described above (or
others which are substantially similar) is permitted during the
embargo, even on public-facing systems with untrusted guest users and
administrators.

But: Distribution of updated software is prohibited (except to other
members of the predisclosure list).

Predisclosure list members who wish to deploy significantly different
patches and/or mitigations, please contact the Xen Project Security
Team.

(Note: this during-embargo deployment notice is retained in
post-embargo publicly released Xen Project advisories, even though it
is then no longer applicable.  This is to enable the community to have
oversight of the Xen Project Security Team's decisionmaking.)

For more information about permissible uses of embargoed information,
consult the Xen Project community's agreed Security Policy:
  http://www.xenproject.org/security-policy.html
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmUKuSAMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZtL0IAL3mXsj7Q5Xfu/Tof0a1ie7TnpvZ2qXxzoLlyiFR
Vra9gs83Nw7n45yXFFVLSzTjmz2bCbCmUowPp6TxF9Nawt0JocbF80JpYKEojEko
6B2BAdUFhPXtx1D6NruzG2gVr5qn/eNJjIIos0o7tzxtBPLKX9qzLh3FmZK5BJm2
HyKMLIEZuVipb3Qtb+avUDHvLjee6p4eaaWOk08g3sSWhtSfwxlS4IF9j1G2Oejj
QKZ1XILCP8miXmuUZJ/L/7CzFvOm+DKNVFZYhFT0fjDWk3vNhtLcBv5s36Z65gKK
MvKe7owffmclQLWjOekYNm8dG5gQ/OkWRAPbxiwRMegT22g=
=L3du
-----END PGP SIGNATURE-----
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/shadow: defer releasing of PV's top-level shadow reference

sh_set_toplevel_shadow() re-pinning the top-level shadow we may be
running on is not enough (and at the same time unnecessary when the
shadow isn't what we're running on): That shadow becomes eligible for
blowing away (from e.g. shadow_prealloc()) immediately after the
paging lock was dropped. Yet it needs to remain valid until the actual
page table switch occurred.

Propagate up the call chain the shadow entry that needs releasing
eventually, and carry out the release immediately after switching page
tables. Handle update_cr3() failures by switching to idle pagetables.
Note that various further uses of update_cr3() are HVM-only or only act
on paused vCPU-s, in which case sh_set_toplevel_shadow() will not defer
releasing of the reference.

While changing the update_cr3() hook, also convert the "do_locking"
parameter to boolean.

This is CVE-2023-34322 / XSA-438.

Reported-by: Tim Deegan <tim@xen.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@cloud.com>
---
The bubbling up of the shadow entry could likely be avoided (by
inspecting v->arch.cr3 and the corresponding frame_table[] entry), but
going that route seemed quite a bit more fragile to me.

In _toggle_guest_pt() I was heavily inclined to also move up cr3's
initializer: This wouldn't grow the diff, but since it would still be
slightly more code churn, I then didn't do so.
---
v4: Convert a condition operator use to if(). Extend comments.
v3: In _toggle_guest_pt() don't mistakenly pass the idle domain to
    shadow_put_top_level(). Add locking to shadow_put_top_level().
    Always consume sh_set_toplevel_shadow()'s return value. Add
    assertions. Massage description.
v2: Fix !SHADOW_PAGING build.

--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -556,7 +556,7 @@ void audit_domains(void);
 #endif
 
 void make_cr3(struct vcpu *v, mfn_t mfn);
-void update_cr3(struct vcpu *v);
+pagetable_t update_cr3(struct vcpu *v);
 int vcpu_destroy_pagetables(struct vcpu *);
 void *do_page_walk(struct vcpu *v, unsigned long addr);
 
--- a/xen/arch/x86/include/asm/paging.h
+++ b/xen/arch/x86/include/asm/paging.h
@@ -118,7 +118,7 @@ struct paging_mode {
                                             paddr_t ga, uint32_t *pfec,
                                             unsigned int *page_order);
 #endif
-    void          (*update_cr3            )(struct vcpu *v, int do_locking,
+    pagetable_t   (*update_cr3            )(struct vcpu *v, bool do_locking,
                                             bool noflush);
 
     unsigned int guest_levels;
@@ -294,9 +294,9 @@ static inline bool paging_flush_tlb(cons
 /* Update all the things that are derived from the guest's CR3.
  * Called when the guest changes CR3; the caller can then use v->arch.cr3
  * as the value to load into the host CR3 to schedule this vcpu */
-static inline void paging_update_cr3(struct vcpu *v, bool noflush)
+static inline pagetable_t paging_update_cr3(struct vcpu *v, bool noflush)
 {
-    paging_get_hostmode(v)->update_cr3(v, 1, noflush);
+    return paging_get_hostmode(v)->update_cr3(v, 1, noflush);
 }
 
 /* Update all the things that are derived from the guest's CR0/CR3/CR4.
--- a/xen/arch/x86/include/asm/shadow.h
+++ b/xen/arch/x86/include/asm/shadow.h
@@ -94,6 +94,9 @@ void shadow_blow_tables_per_domain(struc
 int shadow_set_allocation(struct domain *d, unsigned int pages,
                           bool *preempted);
 
+/* Helper to invoke for deferred releasing of a top-level shadow's reference. */
+void shadow_put_top_level(struct domain *d, pagetable_t old);
+
 #else /* !CONFIG_SHADOW_PAGING */
 
 #define shadow_vcpu_teardown(v) ASSERT(is_pv_vcpu(v))
@@ -114,6 +117,11 @@ static inline void shadow_prepare_page_t
 
 static inline void shadow_blow_tables_per_domain(struct domain *d) {}
 
+static inline void shadow_put_top_level(struct domain *d, pagetable_t old)
+{
+    ASSERT_UNREACHABLE();
+}
+
 static inline int shadow_domctl(struct domain *d,
                                 struct xen_domctl_shadow_op *sc,
                                 XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -707,11 +707,13 @@ static bool cf_check hap_invlpg(struct v
     return 1;
 }
 
-static void cf_check hap_update_cr3(
-    struct vcpu *v, int do_locking, bool noflush)
+static pagetable_t cf_check hap_update_cr3(
+    struct vcpu *v, bool do_locking, bool noflush)
 {
     v->arch.hvm.hw_cr[3] = v->arch.hvm.guest_cr[3];
     hvm_update_guest_cr3(v, noflush);
+
+    return pagetable_null();
 }
 
 static bool flush_vcpu(const struct vcpu *v, const unsigned long *vcpu_bitmap)
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2526,13 +2526,13 @@ static void cf_check shadow_update_pagin
 }
 
 /* Set up the top-level shadow and install it in slot 'slot' of shadow_table */
-void sh_set_toplevel_shadow(struct vcpu *v,
-                            unsigned int slot,
-                            mfn_t gmfn,
-                            unsigned int root_type,
-                            mfn_t (*make_shadow)(struct vcpu *v,
-                                                 mfn_t gmfn,
-                                                 uint32_t shadow_type))
+pagetable_t sh_set_toplevel_shadow(struct vcpu *v,
+                                   unsigned int slot,
+                                   mfn_t gmfn,
+                                   unsigned int root_type,
+                                   mfn_t (*make_shadow)(struct vcpu *v,
+                                                        mfn_t gmfn,
+                                                        uint32_t shadow_type))
 {
     mfn_t smfn;
     pagetable_t old_entry, new_entry;
@@ -2589,20 +2589,37 @@ void sh_set_toplevel_shadow(struct vcpu
                   mfn_x(gmfn), mfn_x(pagetable_get_mfn(new_entry)));
     v->arch.paging.shadow.shadow_table[slot] = new_entry;
 
-    /* Decrement the refcount of the old contents of this slot */
-    if ( !pagetable_is_null(old_entry) )
+    /*
+     * Decrement the refcount of the old contents of this slot, unless
+     * we're still running on that shadow - in that case it'll need holding
+     * on to until the actual page table switch did occur.
+     */
+    if ( !pagetable_is_null(old_entry) && (v != current || !is_pv_domain(d)) )
     {
-        mfn_t old_smfn = pagetable_get_mfn(old_entry);
-        /* Need to repin the old toplevel shadow if it's been unpinned
-         * by shadow_prealloc(): in PV mode we're still running on this
-         * shadow and it's not safe to free it yet. */
-        if ( !mfn_to_page(old_smfn)->u.sh.pinned && !sh_pin(d, old_smfn) )
-        {
-            printk(XENLOG_G_ERR "can't re-pin %"PRI_mfn"\n", mfn_x(old_smfn));
-            domain_crash(d);
-        }
-        sh_put_ref(d, old_smfn, 0);
+        sh_put_ref(d, pagetable_get_mfn(old_entry), 0);
+        old_entry = pagetable_null();
     }
+
+    /*
+     * 2- and 3-level shadow mode is used for HVM only. Therefore we never run
+     * on such a shadow, so only call sites requesting an L4 shadow need to pay
+     * attention to the returned value.
+     */
+    ASSERT(pagetable_is_null(old_entry) || root_type == SH_type_l4_64_shadow);
+
+    return old_entry;
+}
+
+/*
+ * Helper invoked when releasing of a top-level shadow's reference was
+ * deferred in sh_set_toplevel_shadow() above.
+ */
+void shadow_put_top_level(struct domain *d, pagetable_t old_entry)
+{
+    ASSERT(!pagetable_is_null(old_entry));
+    paging_lock(d);
+    sh_put_ref(d, pagetable_get_mfn(old_entry), 0);
+    paging_unlock(d);
 }
 
 /**************************************************************************/
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3156,7 +3156,8 @@ sh_update_linear_entries(struct vcpu *v)
     sh_flush_local(d);
 }
 
-static void cf_check sh_update_cr3(struct vcpu *v, int do_locking, bool noflush)
+static pagetable_t cf_check sh_update_cr3(struct vcpu *v, bool do_locking,
+                                          bool noflush)
 /* Updates vcpu->arch.cr3 after the guest has changed CR3.
  * Paravirtual guests should set v->arch.guest_table (and guest_table_user,
  * if appropriate).
@@ -3170,6 +3171,7 @@ static void cf_check sh_update_cr3(struc
 {
     struct domain *d = v->domain;
     mfn_t gmfn;
+    pagetable_t old_entry = pagetable_null();
 #if GUEST_PAGING_LEVELS == 3
     const guest_l3e_t *gl3e;
     unsigned int i, guest_idx;
@@ -3179,7 +3181,7 @@ static void cf_check sh_update_cr3(struc
     if ( !is_hvm_domain(d) && !v->is_initialised )
     {
         ASSERT(v->arch.cr3 == 0);
-        return;
+        return old_entry;
     }
 
     if ( do_locking ) paging_lock(v->domain);
@@ -3252,11 +3254,12 @@ static void cf_check sh_update_cr3(struc
 #if GUEST_PAGING_LEVELS == 4
     if ( sh_remove_write_access(d, gmfn, 4, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
-    sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l4_shadow, sh_make_shadow);
+    old_entry = sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l4_shadow,
+                                       sh_make_shadow);
     if ( unlikely(pagetable_is_null(v->arch.paging.shadow.shadow_table[0])) )
     {
         ASSERT(d->is_dying || d->is_shutting_down);
-        return;
+        return old_entry;
     }
     if ( !shadow_mode_external(d) && !is_pv_32bit_domain(d) )
     {
@@ -3300,24 +3303,30 @@ static void cf_check sh_update_cr3(struc
                 gl2gfn = guest_l3e_get_gfn(gl3e[i]);
                 gl2mfn = get_gfn_query_unlocked(d, gfn_x(gl2gfn), &p2mt);
                 if ( p2m_is_ram(p2mt) )
-                    sh_set_toplevel_shadow(v, i, gl2mfn, SH_type_l2_shadow,
-                                           sh_make_shadow);
+                    old_entry = sh_set_toplevel_shadow(v, i, gl2mfn,
+                                                       SH_type_l2_shadow,
+                                                       sh_make_shadow);
                 else
-                    sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
-                                           sh_make_shadow);
+                    old_entry = sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
+                                                       sh_make_shadow);
             }
             else
-                sh_set_toplevel_shadow(v, i, INVALID_MFN, 0, sh_make_shadow);
+                old_entry = sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
+                                                   sh_make_shadow);
+
+            ASSERT(pagetable_is_null(old_entry));
         }
     }
 #elif GUEST_PAGING_LEVELS == 2
     if ( sh_remove_write_access(d, gmfn, 2, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
-    sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l2_shadow, sh_make_shadow);
+    old_entry = sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l2_shadow,
+                                       sh_make_shadow);
+    ASSERT(pagetable_is_null(old_entry));
     if ( unlikely(pagetable_is_null(v->arch.paging.shadow.shadow_table[0])) )
     {
         ASSERT(d->is_dying || d->is_shutting_down);
-        return;
+        return old_entry;
     }
 #else
 #error This should never happen
@@ -3405,6 +3414,8 @@ static void cf_check sh_update_cr3(struc
 
     /* Release the lock, if we took it (otherwise it's the caller's problem) */
     if ( do_locking ) paging_unlock(v->domain);
+
+    return old_entry;
 }
 
 
--- a/xen/arch/x86/mm/shadow/none.c
+++ b/xen/arch/x86/mm/shadow/none.c
@@ -55,9 +55,11 @@ static unsigned long cf_check _gva_to_gf
 }
 #endif
 
-static void cf_check _update_cr3(struct vcpu *v, int do_locking, bool noflush)
+static pagetable_t cf_check _update_cr3(struct vcpu *v, bool do_locking,
+                                        bool noflush)
 {
     ASSERT_UNREACHABLE();
+    return pagetable_null();
 }
 
 static const struct paging_mode sh_paging_none = {
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -383,13 +383,13 @@ mfn_t shadow_alloc(struct domain *d,
 void  shadow_free(struct domain *d, mfn_t smfn);
 
 /* Set up the top-level shadow and install it in slot 'slot' of shadow_table */
-void sh_set_toplevel_shadow(struct vcpu *v,
-                            unsigned int slot,
-                            mfn_t gmfn,
-                            unsigned int root_type,
-                            mfn_t (*make_shadow)(struct vcpu *v,
-                                                 mfn_t gmfn,
-                                                 uint32_t shadow_type));
+pagetable_t sh_set_toplevel_shadow(struct vcpu *v,
+                                   unsigned int slot,
+                                   mfn_t gmfn,
+                                   unsigned int root_type,
+                                   mfn_t (*make_shadow)(struct vcpu *v,
+                                                        mfn_t gmfn,
+                                                        uint32_t shadow_type));
 
 /* Update the shadows in response to a pagetable write from Xen */
 int sh_validate_guest_entry(struct vcpu *v, mfn_t gmfn, void *entry, u32 size);
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -549,15 +549,12 @@ void write_ptbase(struct vcpu *v)
  *
  * Update ref counts to shadow tables appropriately.
  */
-void update_cr3(struct vcpu *v)
+pagetable_t update_cr3(struct vcpu *v)
 {
     mfn_t cr3_mfn;
 
     if ( paging_mode_enabled(v->domain) )
-    {
-        paging_update_cr3(v, false);
-        return;
-    }
+        return paging_update_cr3(v, false);
 
     if ( !(v->arch.flags & TF_kernel_mode) )
         cr3_mfn = pagetable_get_mfn(v->arch.guest_table_user);
@@ -565,6 +562,8 @@ void update_cr3(struct vcpu *v)
         cr3_mfn = pagetable_get_mfn(v->arch.guest_table);
 
     make_cr3(v, cr3_mfn);
+
+    return pagetable_null();
 }
 
 static inline void set_tlbflush_timestamp(struct page_info *page)
@@ -3269,6 +3268,7 @@ int new_guest_cr3(mfn_t mfn)
     struct domain *d = curr->domain;
     int rc;
     mfn_t old_base_mfn;
+    pagetable_t old_shadow;
 
     if ( is_pv_32bit_domain(d) )
     {
@@ -3336,9 +3336,22 @@ int new_guest_cr3(mfn_t mfn)
     if ( !VM_ASSIST(d, m2p_strict) )
         fill_ro_mpt(mfn);
     curr->arch.guest_table = pagetable_from_mfn(mfn);
-    update_cr3(curr);
+    old_shadow = update_cr3(curr);
+
+    /*
+     * In shadow mode update_cr3() can fail, in which case here we're still
+     * running on the prior top-level shadow (which we're about to release).
+     * Switch to the idle page tables in such an event; the guest will have
+     * been crashed already.
+     */
+    if ( likely(!mfn_eq(pagetable_get_mfn(old_shadow),
+                        maddr_to_mfn(curr->arch.cr3 & ~X86_CR3_NOFLUSH))) )
+        write_ptbase(curr);
+    else
+        write_ptbase(idle_vcpu[curr->processor]);
 
-    write_ptbase(curr);
+    if ( !pagetable_is_null(old_shadow) )
+        shadow_put_top_level(d, old_shadow);
 
     if ( likely(mfn_x(old_base_mfn) != 0) )
     {
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -424,10 +424,13 @@ bool __init xpti_pcid_enabled(void)
 
 static void _toggle_guest_pt(struct vcpu *v)
 {
+    bool guest_update;
+    pagetable_t old_shadow;
     unsigned long cr3;
 
     v->arch.flags ^= TF_kernel_mode;
-    update_cr3(v);
+    guest_update = v->arch.flags & TF_kernel_mode;
+    old_shadow = update_cr3(v);
 
     /*
      * Don't flush user global mappings from the TLB. Don't tick TLB clock.
@@ -436,13 +439,31 @@ static void _toggle_guest_pt(struct vcpu
      * TLB flush (for just the incoming PCID), as the top level page table may
      * have changed behind our backs. To be on the safe side, suppress the
      * no-flush unconditionally in this case.
+     *
+     * Furthermore in shadow mode update_cr3() can fail, in which case here
+     * we're still running on the prior top-level shadow (which we're about
+     * to release). Switch to the idle page tables in such an event; the
+     * guest will have been crashed already.
      */
     cr3 = v->arch.cr3;
     if ( shadow_mode_enabled(v->domain) )
+    {
         cr3 &= ~X86_CR3_NOFLUSH;
+
+        if ( unlikely(mfn_eq(pagetable_get_mfn(old_shadow),
+                             maddr_to_mfn(cr3))) )
+        {
+            cr3 = idle_vcpu[v->processor]->arch.cr3;
+            /* Also suppress runstate/time area updates below. */
+            guest_update = false;
+        }
+    }
     write_cr3(cr3);
 
-    if ( !(v->arch.flags & TF_kernel_mode) )
+    if ( !pagetable_is_null(old_shadow) )
+        shadow_put_top_level(v->domain, old_shadow);
+
+    if ( !guest_update )
         return;
 
     if ( v->arch.pv.need_update_runstate_area && update_runstate_area(v) )
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/shadow: defer releasing of PV's top-level shadow reference

sh_set_toplevel_shadow() re-pinning the top-level shadow we may be
running on is not enough (and at the same time unnecessary when the
shadow isn't what we're running on): That shadow becomes eligible for
blowing away (from e.g. shadow_prealloc()) immediately after the
paging lock was dropped. Yet it needs to remain valid until the actual
page table switch occurred.

Propagate up the call chain the shadow entry that needs releasing
eventually, and carry out the release immediately after switching page
tables. Handle update_cr3() failures by switching to idle pagetables.
Note that various further uses of update_cr3() are HVM-only or only act
on paused vCPU-s, in which case sh_set_toplevel_shadow() will not defer
releasing of the reference.

While changing the update_cr3() hook, also convert the "do_locking"
parameter to boolean.

This is CVE-2023-34322 / XSA-438.

Reported-by: Tim Deegan <tim@xen.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@cloud.com>

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -728,10 +728,12 @@ static bool_t hap_invlpg(struct vcpu *v,
     return 1;
 }
 
-static void hap_update_cr3(struct vcpu *v, int do_locking, bool noflush)
+static pagetable_t hap_update_cr3(struct vcpu *v, bool do_locking, bool noflush)
 {
     v->arch.hvm.hw_cr[3] = v->arch.hvm.guest_cr[3];
     hvm_update_guest_cr3(v, noflush);
+
+    return pagetable_null();
 }
 
 /*
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2641,13 +2641,13 @@ void shadow_update_paging_modes(struct v
 }
 
 /* Set up the top-level shadow and install it in slot 'slot' of shadow_table */
-void sh_set_toplevel_shadow(struct vcpu *v,
-                            unsigned int slot,
-                            mfn_t gmfn,
-                            unsigned int root_type,
-                            mfn_t (*make_shadow)(struct vcpu *v,
-                                                 mfn_t gmfn,
-                                                 uint32_t shadow_type))
+pagetable_t sh_set_toplevel_shadow(struct vcpu *v,
+                                   unsigned int slot,
+                                   mfn_t gmfn,
+                                   unsigned int root_type,
+                                   mfn_t (*make_shadow)(struct vcpu *v,
+                                                        mfn_t gmfn,
+                                                        uint32_t shadow_type))
 {
     mfn_t smfn;
     pagetable_t old_entry, new_entry;
@@ -2704,20 +2704,37 @@ void sh_set_toplevel_shadow(struct vcpu
                   mfn_x(gmfn), mfn_x(pagetable_get_mfn(new_entry)));
     v->arch.paging.shadow.shadow_table[slot] = new_entry;
 
-    /* Decrement the refcount of the old contents of this slot */
-    if ( !pagetable_is_null(old_entry) )
+    /*
+     * Decrement the refcount of the old contents of this slot, unless
+     * we're still running on that shadow - in that case it'll need holding
+     * on to until the actual page table switch did occur.
+     */
+    if ( !pagetable_is_null(old_entry) && (v != current || !is_pv_domain(d)) )
     {
-        mfn_t old_smfn = pagetable_get_mfn(old_entry);
-        /* Need to repin the old toplevel shadow if it's been unpinned
-         * by shadow_prealloc(): in PV mode we're still running on this
-         * shadow and it's not safe to free it yet. */
-        if ( !mfn_to_page(old_smfn)->u.sh.pinned && !sh_pin(d, old_smfn) )
-        {
-            printk(XENLOG_G_ERR "can't re-pin %"PRI_mfn"\n", mfn_x(old_smfn));
-            domain_crash(d);
-        }
-        sh_put_ref(d, old_smfn, 0);
+        sh_put_ref(d, pagetable_get_mfn(old_entry), 0);
+        old_entry = pagetable_null();
     }
+
+    /*
+     * 2- and 3-level shadow mode is used for HVM only. Therefore we never run
+     * on such a shadow, so only call sites requesting an L4 shadow need to pay
+     * attention to the returned value.
+     */
+    ASSERT(pagetable_is_null(old_entry) || root_type == SH_type_l4_64_shadow);
+
+    return old_entry;
+}
+
+/*
+ * Helper invoked when releasing of a top-level shadow's reference was
+ * deferred in sh_set_toplevel_shadow() above.
+ */
+void shadow_put_top_level(struct domain *d, pagetable_t old_entry)
+{
+    ASSERT(!pagetable_is_null(old_entry));
+    paging_lock(d);
+    sh_put_ref(d, pagetable_get_mfn(old_entry), 0);
+    paging_unlock(d);
 }
 
 /**************************************************************************/
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3604,8 +3604,8 @@ sh_detach_old_tables(struct vcpu *v)
     }
 }
 
-static void
-sh_update_cr3(struct vcpu *v, int do_locking, bool noflush)
+static pagetable_t
+sh_update_cr3(struct vcpu *v, bool do_locking, bool noflush)
 /* Updates vcpu->arch.cr3 after the guest has changed CR3.
  * Paravirtual guests should set v->arch.guest_table (and guest_table_user,
  * if appropriate).
@@ -3619,6 +3619,7 @@ sh_update_cr3(struct vcpu *v, int do_loc
 {
     struct domain *d = v->domain;
     mfn_t gmfn;
+    pagetable_t old_entry = pagetable_null();
 #if GUEST_PAGING_LEVELS == 3 && defined(CONFIG_HVM)
     const guest_l3e_t *gl3e;
     unsigned int i, guest_idx;
@@ -3628,7 +3629,7 @@ sh_update_cr3(struct vcpu *v, int do_loc
     if ( is_pv_domain(d) && !v->is_initialised )
     {
         ASSERT(v->arch.cr3 == 0);
-        return;
+        return old_entry;
     }
 
     if ( do_locking ) paging_lock(v->domain);
@@ -3701,11 +3702,12 @@ sh_update_cr3(struct vcpu *v, int do_loc
 #if GUEST_PAGING_LEVELS == 4
     if ( sh_remove_write_access(d, gmfn, 4, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
-    sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l4_shadow, sh_make_shadow);
+    old_entry = sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l4_shadow,
+                                       sh_make_shadow);
     if ( unlikely(pagetable_is_null(v->arch.paging.shadow.shadow_table[0])) )
     {
         ASSERT(d->is_dying || d->is_shutting_down);
-        return;
+        return old_entry;
     }
     if ( !shadow_mode_external(d) && !is_pv_32bit_domain(d) )
     {
@@ -3751,26 +3753,32 @@ sh_update_cr3(struct vcpu *v, int do_loc
                 gl2gfn = guest_l3e_get_gfn(gl3e[i]);
                 gl2mfn = get_gfn_query_unlocked(d, gfn_x(gl2gfn), &p2mt);
                 if ( p2m_is_ram(p2mt) )
-                    sh_set_toplevel_shadow(v, i, gl2mfn, (i == 3)
-                                           ? SH_type_l2h_shadow
-                                           : SH_type_l2_shadow,
-                                           sh_make_shadow);
+                    old_entry = sh_set_toplevel_shadow(v, i, gl2mfn,
+                                                       (i == 3
+                                                        ? SH_type_l2h_shadow
+                                                        : SH_type_l2_shadow),
+                                                       sh_make_shadow);
                 else
-                    sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
-                                           sh_make_shadow);
+                    old_entry = sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
+                                                       sh_make_shadow);
             }
             else
-                sh_set_toplevel_shadow(v, i, INVALID_MFN, 0, sh_make_shadow);
+                old_entry = sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
+                                                   sh_make_shadow);
+
+            ASSERT(pagetable_is_null(old_entry));
         }
     }
 #elif GUEST_PAGING_LEVELS == 2
     if ( sh_remove_write_access(d, gmfn, 2, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
-    sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l2_shadow, sh_make_shadow);
+    old_entry = sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l2_shadow,
+                                       sh_make_shadow);
+    ASSERT(pagetable_is_null(old_entry));
     if ( unlikely(pagetable_is_null(v->arch.paging.shadow.shadow_table[0])) )
     {
         ASSERT(d->is_dying || d->is_shutting_down);
-        return;
+        return old_entry;
     }
 #else
 #error This should never happen
@@ -3864,6 +3872,8 @@ sh_update_cr3(struct vcpu *v, int do_loc
 
     /* Release the lock, if we took it (otherwise it's the caller's problem) */
     if ( do_locking ) paging_unlock(v->domain);
+
+    return old_entry;
 }
 
 
--- a/xen/arch/x86/mm/shadow/none.c
+++ b/xen/arch/x86/mm/shadow/none.c
@@ -50,9 +50,10 @@ static unsigned long _gva_to_gfn(struct
     return gfn_x(INVALID_GFN);
 }
 
-static void _update_cr3(struct vcpu *v, int do_locking, bool noflush)
+static pagetable_t _update_cr3(struct vcpu *v, bool do_locking, bool noflush)
 {
     ASSERT_UNREACHABLE();
+    return pagetable_null();
 }
 
 static void _update_paging_modes(struct vcpu *v)
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -360,13 +360,13 @@ mfn_t shadow_alloc(struct domain *d,
 void  shadow_free(struct domain *d, mfn_t smfn);
 
 /* Set up the top-level shadow and install it in slot 'slot' of shadow_table */
-void sh_set_toplevel_shadow(struct vcpu *v,
-                            unsigned int slot,
-                            mfn_t gmfn,
-                            unsigned int root_type,
-                            mfn_t (*make_shadow)(struct vcpu *v,
-                                                 mfn_t gmfn,
-                                                 uint32_t shadow_type));
+pagetable_t sh_set_toplevel_shadow(struct vcpu *v,
+                                   unsigned int slot,
+                                   mfn_t gmfn,
+                                   unsigned int root_type,
+                                   mfn_t (*make_shadow)(struct vcpu *v,
+                                                        mfn_t gmfn,
+                                                        uint32_t shadow_type));
 
 /* Update the shadows in response to a pagetable write from Xen */
 int sh_validate_guest_entry(struct vcpu *v, mfn_t gmfn, void *entry, u32 size);
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -564,15 +564,12 @@ void write_ptbase(struct vcpu *v)
  *
  * Update ref counts to shadow tables appropriately.
  */
-void update_cr3(struct vcpu *v)
+pagetable_t update_cr3(struct vcpu *v)
 {
     mfn_t cr3_mfn;
 
     if ( paging_mode_enabled(v->domain) )
-    {
-        paging_update_cr3(v, false);
-        return;
-    }
+        return paging_update_cr3(v, false);
 
     if ( !(v->arch.flags & TF_kernel_mode) )
         cr3_mfn = pagetable_get_mfn(v->arch.guest_table_user);
@@ -580,6 +577,8 @@ void update_cr3(struct vcpu *v)
         cr3_mfn = pagetable_get_mfn(v->arch.guest_table);
 
     make_cr3(v, cr3_mfn);
+
+    return pagetable_null();
 }
 
 static inline void set_tlbflush_timestamp(struct page_info *page)
@@ -3241,6 +3240,7 @@ int new_guest_cr3(mfn_t mfn)
     struct domain *d = curr->domain;
     int rc;
     mfn_t old_base_mfn;
+    pagetable_t old_shadow;
 
     if ( is_pv_32bit_domain(d) )
     {
@@ -3308,9 +3308,22 @@ int new_guest_cr3(mfn_t mfn)
     if ( !VM_ASSIST(d, m2p_strict) )
         fill_ro_mpt(mfn);
     curr->arch.guest_table = pagetable_from_mfn(mfn);
-    update_cr3(curr);
+    old_shadow = update_cr3(curr);
+
+    /*
+     * In shadow mode update_cr3() can fail, in which case here we're still
+     * running on the prior top-level shadow (which we're about to release).
+     * Switch to the idle page tables in such an event; the guest will have
+     * been crashed already.
+     */
+    if ( likely(!mfn_eq(pagetable_get_mfn(old_shadow),
+                        maddr_to_mfn(curr->arch.cr3 & ~X86_CR3_NOFLUSH))) )
+        write_ptbase(curr);
+    else
+        write_ptbase(idle_vcpu[curr->processor]);
 
-    write_ptbase(curr);
+    if ( !pagetable_is_null(old_shadow) )
+        shadow_put_top_level(d, old_shadow);
 
     if ( likely(mfn_x(old_base_mfn) != 0) )
     {
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -408,10 +408,13 @@ bool __init xpti_pcid_enabled(void)
 
 static void _toggle_guest_pt(struct vcpu *v)
 {
+    bool guest_update;
+    pagetable_t old_shadow;
     unsigned long cr3;
 
     v->arch.flags ^= TF_kernel_mode;
-    update_cr3(v);
+    guest_update = v->arch.flags & TF_kernel_mode;
+    old_shadow = update_cr3(v);
 
     /*
      * Don't flush user global mappings from the TLB. Don't tick TLB clock.
@@ -420,13 +423,31 @@ static void _toggle_guest_pt(struct vcpu
      * TLB flush (for just the incoming PCID), as the top level page table may
      * have changed behind our backs. To be on the safe side, suppress the
      * no-flush unconditionally in this case.
+     *
+     * Furthermore in shadow mode update_cr3() can fail, in which case here
+     * we're still running on the prior top-level shadow (which we're about
+     * to release). Switch to the idle page tables in such an event; the
+     * guest will have been crashed already.
      */
     cr3 = v->arch.cr3;
     if ( shadow_mode_enabled(v->domain) )
+    {
         cr3 &= ~X86_CR3_NOFLUSH;
+
+        if ( unlikely(mfn_eq(pagetable_get_mfn(old_shadow),
+                             maddr_to_mfn(cr3))) )
+        {
+            cr3 = idle_vcpu[v->processor]->arch.cr3;
+            /* Also suppress runstate/time area updates below. */
+            guest_update = false;
+        }
+    }
     write_cr3(cr3);
 
-    if ( !(v->arch.flags & TF_kernel_mode) )
+    if ( !pagetable_is_null(old_shadow) )
+        shadow_put_top_level(v->domain, old_shadow);
+
+    if ( !guest_update )
         return;
 
     if ( v->arch.pv.need_update_runstate_area && update_runstate_area(v) )
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -569,7 +569,7 @@ void audit_domains(void);
 #endif
 
 void make_cr3(struct vcpu *v, mfn_t mfn);
-void update_cr3(struct vcpu *v);
+pagetable_t update_cr3(struct vcpu *v);
 int vcpu_destroy_pagetables(struct vcpu *);
 void *do_page_walk(struct vcpu *v, unsigned long addr);
 
--- a/xen/include/asm-x86/paging.h
+++ b/xen/include/asm-x86/paging.h
@@ -136,7 +136,7 @@ struct paging_mode {
                                             unsigned long cr3,
                                             paddr_t ga, uint32_t *pfec,
                                             unsigned int *page_order);
-    void          (*update_cr3            )(struct vcpu *v, int do_locking,
+    pagetable_t   (*update_cr3            )(struct vcpu *v, bool do_locking,
                                             bool noflush);
     void          (*update_paging_modes   )(struct vcpu *v);
     bool          (*flush_tlb             )(bool (*flush_vcpu)(void *ctxt,
@@ -309,9 +309,9 @@ static inline unsigned long paging_ga_to
 /* Update all the things that are derived from the guest's CR3.
  * Called when the guest changes CR3; the caller can then use v->arch.cr3
  * as the value to load into the host CR3 to schedule this vcpu */
-static inline void paging_update_cr3(struct vcpu *v, bool noflush)
+static inline pagetable_t paging_update_cr3(struct vcpu *v, bool noflush)
 {
-    paging_get_hostmode(v)->update_cr3(v, 1, noflush);
+    return paging_get_hostmode(v)->update_cr3(v, 1, noflush);
 }
 
 /* Update all the things that are derived from the guest's CR0/CR3/CR4.
--- a/xen/include/asm-x86/shadow.h
+++ b/xen/include/asm-x86/shadow.h
@@ -97,6 +97,9 @@ void shadow_blow_tables_per_domain(struc
 int shadow_set_allocation(struct domain *d, unsigned int pages,
                           bool *preempted);
 
+/* Helper to invoke for deferred releasing of a top-level shadow's reference. */
+void shadow_put_top_level(struct domain *d, pagetable_t old);
+
 #else /* !CONFIG_SHADOW_PAGING */
 
 #define shadow_vcpu_teardown(v) ASSERT(is_pv_vcpu(v))
@@ -118,6 +121,11 @@ static inline void shadow_prepare_page_t
 
 static inline void shadow_blow_tables_per_domain(struct domain *d) {}
 
+static inline void shadow_put_top_level(struct domain *d, pagetable_t old)
+{
+    ASSERT_UNREACHABLE();
+}
+
 static inline int shadow_domctl(struct domain *d,
                                 struct xen_domctl_shadow_op *sc,
                                 XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/shadow: defer releasing of PV's top-level shadow reference

sh_set_toplevel_shadow() re-pinning the top-level shadow we may be
running on is not enough (and at the same time unnecessary when the
shadow isn't what we're running on): That shadow becomes eligible for
blowing away (from e.g. shadow_prealloc()) immediately after the
paging lock was dropped. Yet it needs to remain valid until the actual
page table switch occurred.

Propagate up the call chain the shadow entry that needs releasing
eventually, and carry out the release immediately after switching page
tables. Handle update_cr3() failures by switching to idle pagetables.
Note that various further uses of update_cr3() are HVM-only or only act
on paused vCPU-s, in which case sh_set_toplevel_shadow() will not defer
releasing of the reference.

While changing the update_cr3() hook, also convert the "do_locking"
parameter to boolean.

This is CVE-2023-34322 / XSA-438.

Reported-by: Tim Deegan <tim@xen.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@cloud.com>

--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -728,10 +728,12 @@ static bool_t hap_invlpg(struct vcpu *v,
     return 1;
 }
 
-static void hap_update_cr3(struct vcpu *v, int do_locking, bool noflush)
+static pagetable_t hap_update_cr3(struct vcpu *v, bool do_locking, bool noflush)
 {
     v->arch.hvm.hw_cr[3] = v->arch.hvm.guest_cr[3];
     hvm_update_guest_cr3(v, noflush);
+
+    return pagetable_null();
 }
 
 /*
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2574,13 +2574,13 @@ void shadow_update_paging_modes(struct v
 }
 
 /* Set up the top-level shadow and install it in slot 'slot' of shadow_table */
-void sh_set_toplevel_shadow(struct vcpu *v,
-                            unsigned int slot,
-                            mfn_t gmfn,
-                            unsigned int root_type,
-                            mfn_t (*make_shadow)(struct vcpu *v,
-                                                 mfn_t gmfn,
-                                                 uint32_t shadow_type))
+pagetable_t sh_set_toplevel_shadow(struct vcpu *v,
+                                   unsigned int slot,
+                                   mfn_t gmfn,
+                                   unsigned int root_type,
+                                   mfn_t (*make_shadow)(struct vcpu *v,
+                                                        mfn_t gmfn,
+                                                        uint32_t shadow_type))
 {
     mfn_t smfn;
     pagetable_t old_entry, new_entry;
@@ -2637,20 +2637,37 @@ void sh_set_toplevel_shadow(struct vcpu
                   mfn_x(gmfn), mfn_x(pagetable_get_mfn(new_entry)));
     v->arch.paging.shadow.shadow_table[slot] = new_entry;
 
-    /* Decrement the refcount of the old contents of this slot */
-    if ( !pagetable_is_null(old_entry) )
+    /*
+     * Decrement the refcount of the old contents of this slot, unless
+     * we're still running on that shadow - in that case it'll need holding
+     * on to until the actual page table switch did occur.
+     */
+    if ( !pagetable_is_null(old_entry) && (v != current || !is_pv_domain(d)) )
     {
-        mfn_t old_smfn = pagetable_get_mfn(old_entry);
-        /* Need to repin the old toplevel shadow if it's been unpinned
-         * by shadow_prealloc(): in PV mode we're still running on this
-         * shadow and it's not safe to free it yet. */
-        if ( !mfn_to_page(old_smfn)->u.sh.pinned && !sh_pin(d, old_smfn) )
-        {
-            printk(XENLOG_G_ERR "can't re-pin %"PRI_mfn"\n", mfn_x(old_smfn));
-            domain_crash(d);
-        }
-        sh_put_ref(d, old_smfn, 0);
+        sh_put_ref(d, pagetable_get_mfn(old_entry), 0);
+        old_entry = pagetable_null();
     }
+
+    /*
+     * 2- and 3-level shadow mode is used for HVM only. Therefore we never run
+     * on such a shadow, so only call sites requesting an L4 shadow need to pay
+     * attention to the returned value.
+     */
+    ASSERT(pagetable_is_null(old_entry) || root_type == SH_type_l4_64_shadow);
+
+    return old_entry;
+}
+
+/*
+ * Helper invoked when releasing of a top-level shadow's reference was
+ * deferred in sh_set_toplevel_shadow() above.
+ */
+void shadow_put_top_level(struct domain *d, pagetable_t old_entry)
+{
+    ASSERT(!pagetable_is_null(old_entry));
+    paging_lock(d);
+    sh_put_ref(d, pagetable_get_mfn(old_entry), 0);
+    paging_unlock(d);
 }
 
 /**************************************************************************/
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3219,8 +3219,8 @@ sh_detach_old_tables(struct vcpu *v)
     }
 }
 
-static void
-sh_update_cr3(struct vcpu *v, int do_locking, bool noflush)
+static pagetable_t
+sh_update_cr3(struct vcpu *v, bool do_locking, bool noflush)
 /* Updates vcpu->arch.cr3 after the guest has changed CR3.
  * Paravirtual guests should set v->arch.guest_table (and guest_table_user,
  * if appropriate).
@@ -3234,6 +3234,7 @@ sh_update_cr3(struct vcpu *v, int do_loc
 {
     struct domain *d = v->domain;
     mfn_t gmfn;
+    pagetable_t old_entry = pagetable_null();
 #if GUEST_PAGING_LEVELS == 3 && defined(CONFIG_HVM)
     const guest_l3e_t *gl3e;
     unsigned int i, guest_idx;
@@ -3243,7 +3244,7 @@ sh_update_cr3(struct vcpu *v, int do_loc
     if ( !is_hvm_domain(d) && !v->is_initialised )
     {
         ASSERT(v->arch.cr3 == 0);
-        return;
+        return old_entry;
     }
 
     if ( do_locking ) paging_lock(v->domain);
@@ -3316,11 +3317,12 @@ sh_update_cr3(struct vcpu *v, int do_loc
 #if GUEST_PAGING_LEVELS == 4
     if ( sh_remove_write_access(d, gmfn, 4, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
-    sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l4_shadow, sh_make_shadow);
+    old_entry = sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l4_shadow,
+                                       sh_make_shadow);
     if ( unlikely(pagetable_is_null(v->arch.paging.shadow.shadow_table[0])) )
     {
         ASSERT(d->is_dying || d->is_shutting_down);
-        return;
+        return old_entry;
     }
     if ( !shadow_mode_external(d) && !is_pv_32bit_domain(d) )
     {
@@ -3366,24 +3368,30 @@ sh_update_cr3(struct vcpu *v, int do_loc
                 gl2gfn = guest_l3e_get_gfn(gl3e[i]);
                 gl2mfn = get_gfn_query_unlocked(d, gfn_x(gl2gfn), &p2mt);
                 if ( p2m_is_ram(p2mt) )
-                    sh_set_toplevel_shadow(v, i, gl2mfn, SH_type_l2_shadow,
-                                           sh_make_shadow);
+                    old_entry = sh_set_toplevel_shadow(v, i, gl2mfn,
+                                                       SH_type_l2_shadow,
+                                                       sh_make_shadow);
                 else
-                    sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
-                                           sh_make_shadow);
+                    old_entry = sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
+                                                       sh_make_shadow);
             }
             else
-                sh_set_toplevel_shadow(v, i, INVALID_MFN, 0, sh_make_shadow);
+                old_entry = sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
+                                                   sh_make_shadow);
+
+            ASSERT(pagetable_is_null(old_entry));
         }
     }
 #elif GUEST_PAGING_LEVELS == 2
     if ( sh_remove_write_access(d, gmfn, 2, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
-    sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l2_shadow, sh_make_shadow);
+    old_entry = sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l2_shadow,
+                                       sh_make_shadow);
+    ASSERT(pagetable_is_null(old_entry));
     if ( unlikely(pagetable_is_null(v->arch.paging.shadow.shadow_table[0])) )
     {
         ASSERT(d->is_dying || d->is_shutting_down);
-        return;
+        return old_entry;
     }
 #else
 #error This should never happen
@@ -3477,6 +3485,8 @@ sh_update_cr3(struct vcpu *v, int do_loc
 
     /* Release the lock, if we took it (otherwise it's the caller's problem) */
     if ( do_locking ) paging_unlock(v->domain);
+
+    return old_entry;
 }
 
 
--- a/xen/arch/x86/mm/shadow/none.c
+++ b/xen/arch/x86/mm/shadow/none.c
@@ -52,9 +52,10 @@ static unsigned long _gva_to_gfn(struct
 }
 #endif
 
-static void _update_cr3(struct vcpu *v, int do_locking, bool noflush)
+static pagetable_t _update_cr3(struct vcpu *v, bool do_locking, bool noflush)
 {
     ASSERT_UNREACHABLE();
+    return pagetable_null();
 }
 
 static void _update_paging_modes(struct vcpu *v)
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -391,13 +391,13 @@ mfn_t shadow_alloc(struct domain *d,
 void  shadow_free(struct domain *d, mfn_t smfn);
 
 /* Set up the top-level shadow and install it in slot 'slot' of shadow_table */
-void sh_set_toplevel_shadow(struct vcpu *v,
-                            unsigned int slot,
-                            mfn_t gmfn,
-                            unsigned int root_type,
-                            mfn_t (*make_shadow)(struct vcpu *v,
-                                                 mfn_t gmfn,
-                                                 uint32_t shadow_type));
+pagetable_t sh_set_toplevel_shadow(struct vcpu *v,
+                                   unsigned int slot,
+                                   mfn_t gmfn,
+                                   unsigned int root_type,
+                                   mfn_t (*make_shadow)(struct vcpu *v,
+                                                        mfn_t gmfn,
+                                                        uint32_t shadow_type));
 
 /* Update the shadows in response to a pagetable write from Xen */
 int sh_validate_guest_entry(struct vcpu *v, mfn_t gmfn, void *entry, u32 size);
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -565,15 +565,12 @@ void write_ptbase(struct vcpu *v)
  *
  * Update ref counts to shadow tables appropriately.
  */
-void update_cr3(struct vcpu *v)
+pagetable_t update_cr3(struct vcpu *v)
 {
     mfn_t cr3_mfn;
 
     if ( paging_mode_enabled(v->domain) )
-    {
-        paging_update_cr3(v, false);
-        return;
-    }
+        return paging_update_cr3(v, false);
 
     if ( !(v->arch.flags & TF_kernel_mode) )
         cr3_mfn = pagetable_get_mfn(v->arch.guest_table_user);
@@ -581,6 +578,8 @@ void update_cr3(struct vcpu *v)
         cr3_mfn = pagetable_get_mfn(v->arch.guest_table);
 
     make_cr3(v, cr3_mfn);
+
+    return pagetable_null();
 }
 
 static inline void set_tlbflush_timestamp(struct page_info *page)
@@ -3254,6 +3253,7 @@ int new_guest_cr3(mfn_t mfn)
     struct domain *d = curr->domain;
     int rc;
     mfn_t old_base_mfn;
+    pagetable_t old_shadow;
 
     if ( is_pv_32bit_domain(d) )
     {
@@ -3321,9 +3321,22 @@ int new_guest_cr3(mfn_t mfn)
     if ( !VM_ASSIST(d, m2p_strict) )
         fill_ro_mpt(mfn);
     curr->arch.guest_table = pagetable_from_mfn(mfn);
-    update_cr3(curr);
+    old_shadow = update_cr3(curr);
+
+    /*
+     * In shadow mode update_cr3() can fail, in which case here we're still
+     * running on the prior top-level shadow (which we're about to release).
+     * Switch to the idle page tables in such an event; the guest will have
+     * been crashed already.
+     */
+    if ( likely(!mfn_eq(pagetable_get_mfn(old_shadow),
+                        maddr_to_mfn(curr->arch.cr3 & ~X86_CR3_NOFLUSH))) )
+        write_ptbase(curr);
+    else
+        write_ptbase(idle_vcpu[curr->processor]);
 
-    write_ptbase(curr);
+    if ( !pagetable_is_null(old_shadow) )
+        shadow_put_top_level(d, old_shadow);
 
     if ( likely(mfn_x(old_base_mfn) != 0) )
     {
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -424,10 +424,13 @@ bool __init xpti_pcid_enabled(void)
 
 static void _toggle_guest_pt(struct vcpu *v)
 {
+    bool guest_update;
+    pagetable_t old_shadow;
     unsigned long cr3;
 
     v->arch.flags ^= TF_kernel_mode;
-    update_cr3(v);
+    guest_update = v->arch.flags & TF_kernel_mode;
+    old_shadow = update_cr3(v);
 
     /*
      * Don't flush user global mappings from the TLB. Don't tick TLB clock.
@@ -436,13 +439,31 @@ static void _toggle_guest_pt(struct vcpu
      * TLB flush (for just the incoming PCID), as the top level page table may
      * have changed behind our backs. To be on the safe side, suppress the
      * no-flush unconditionally in this case.
+     *
+     * Furthermore in shadow mode update_cr3() can fail, in which case here
+     * we're still running on the prior top-level shadow (which we're about
+     * to release). Switch to the idle page tables in such an event; the
+     * guest will have been crashed already.
      */
     cr3 = v->arch.cr3;
     if ( shadow_mode_enabled(v->domain) )
+    {
         cr3 &= ~X86_CR3_NOFLUSH;
+
+        if ( unlikely(mfn_eq(pagetable_get_mfn(old_shadow),
+                             maddr_to_mfn(cr3))) )
+        {
+            cr3 = idle_vcpu[v->processor]->arch.cr3;
+            /* Also suppress runstate/time area updates below. */
+            guest_update = false;
+        }
+    }
     write_cr3(cr3);
 
-    if ( !(v->arch.flags & TF_kernel_mode) )
+    if ( !pagetable_is_null(old_shadow) )
+        shadow_put_top_level(v->domain, old_shadow);
+
+    if ( !guest_update )
         return;
 
     if ( v->arch.pv.need_update_runstate_area && update_runstate_area(v) )
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -564,7 +564,7 @@ void audit_domains(void);
 #endif
 
 void make_cr3(struct vcpu *v, mfn_t mfn);
-void update_cr3(struct vcpu *v);
+pagetable_t update_cr3(struct vcpu *v);
 int vcpu_destroy_pagetables(struct vcpu *);
 void *do_page_walk(struct vcpu *v, unsigned long addr);
 
--- a/xen/include/asm-x86/paging.h
+++ b/xen/include/asm-x86/paging.h
@@ -138,7 +138,7 @@ struct paging_mode {
                                             paddr_t ga, uint32_t *pfec,
                                             unsigned int *page_order);
 #endif
-    void          (*update_cr3            )(struct vcpu *v, int do_locking,
+    pagetable_t   (*update_cr3            )(struct vcpu *v, bool do_locking,
                                             bool noflush);
     void          (*update_paging_modes   )(struct vcpu *v);
     bool          (*flush_tlb             )(bool (*flush_vcpu)(void *ctxt,
@@ -315,9 +315,9 @@ static inline unsigned long paging_ga_to
 /* Update all the things that are derived from the guest's CR3.
  * Called when the guest changes CR3; the caller can then use v->arch.cr3
  * as the value to load into the host CR3 to schedule this vcpu */
-static inline void paging_update_cr3(struct vcpu *v, bool noflush)
+static inline pagetable_t paging_update_cr3(struct vcpu *v, bool noflush)
 {
-    paging_get_hostmode(v)->update_cr3(v, 1, noflush);
+    return paging_get_hostmode(v)->update_cr3(v, 1, noflush);
 }
 
 /* Update all the things that are derived from the guest's CR0/CR3/CR4.
--- a/xen/include/asm-x86/shadow.h
+++ b/xen/include/asm-x86/shadow.h
@@ -97,6 +97,9 @@ void shadow_blow_tables_per_domain(struc
 int shadow_set_allocation(struct domain *d, unsigned int pages,
                           bool *preempted);
 
+/* Helper to invoke for deferred releasing of a top-level shadow's reference. */
+void shadow_put_top_level(struct domain *d, pagetable_t old);
+
 #else /* !CONFIG_SHADOW_PAGING */
 
 #define shadow_vcpu_teardown(v) ASSERT(is_pv_vcpu(v))
@@ -118,6 +121,11 @@ static inline void shadow_prepare_page_t
 
 static inline void shadow_blow_tables_per_domain(struct domain *d) {}
 
+static inline void shadow_put_top_level(struct domain *d, pagetable_t old)
+{
+    ASSERT_UNREACHABLE();
+}
+
 static inline int shadow_domctl(struct domain *d,
                                 struct xen_domctl_shadow_op *sc,
                                 XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
From: Jan Beulich <jbeulich@suse.com>
Subject: x86/shadow: defer releasing of PV's top-level shadow reference

sh_set_toplevel_shadow() re-pinning the top-level shadow we may be
running on is not enough (and at the same time unnecessary when the
shadow isn't what we're running on): That shadow becomes eligible for
blowing away (from e.g. shadow_prealloc()) immediately after the
paging lock was dropped. Yet it needs to remain valid until the actual
page table switch occurred.

Propagate up the call chain the shadow entry that needs releasing
eventually, and carry out the release immediately after switching page
tables. Handle update_cr3() failures by switching to idle pagetables.
Note that various further uses of update_cr3() are HVM-only or only act
on paused vCPU-s, in which case sh_set_toplevel_shadow() will not defer
releasing of the reference.

While changing the update_cr3() hook, also convert the "do_locking"
parameter to boolean.

This is CVE-2023-34322 / XSA-438.

Reported-by: Tim Deegan <tim@xen.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@cloud.com>

--- a/xen/arch/x86/include/asm/mm.h
+++ b/xen/arch/x86/include/asm/mm.h
@@ -552,7 +552,7 @@ void audit_domains(void);
 #endif
 
 void make_cr3(struct vcpu *v, mfn_t mfn);
-void update_cr3(struct vcpu *v);
+pagetable_t update_cr3(struct vcpu *v);
 int vcpu_destroy_pagetables(struct vcpu *);
 void *do_page_walk(struct vcpu *v, unsigned long addr);
 
--- a/xen/arch/x86/include/asm/paging.h
+++ b/xen/arch/x86/include/asm/paging.h
@@ -138,7 +138,7 @@ struct paging_mode {
                                             paddr_t ga, uint32_t *pfec,
                                             unsigned int *page_order);
 #endif
-    void          (*update_cr3            )(struct vcpu *v, int do_locking,
+    pagetable_t   (*update_cr3            )(struct vcpu *v, bool do_locking,
                                             bool noflush);
     void          (*update_paging_modes   )(struct vcpu *v);
     bool          (*flush_tlb             )(const unsigned long *vcpu_bitmap);
@@ -310,9 +310,9 @@ static inline unsigned long paging_ga_to
 /* Update all the things that are derived from the guest's CR3.
  * Called when the guest changes CR3; the caller can then use v->arch.cr3
  * as the value to load into the host CR3 to schedule this vcpu */
-static inline void paging_update_cr3(struct vcpu *v, bool noflush)
+static inline pagetable_t paging_update_cr3(struct vcpu *v, bool noflush)
 {
-    paging_get_hostmode(v)->update_cr3(v, 1, noflush);
+    return paging_get_hostmode(v)->update_cr3(v, 1, noflush);
 }
 
 /* Update all the things that are derived from the guest's CR0/CR3/CR4.
--- a/xen/arch/x86/include/asm/shadow.h
+++ b/xen/arch/x86/include/asm/shadow.h
@@ -99,6 +99,9 @@ int shadow_set_allocation(struct domain
 
 int shadow_get_allocation_bytes(struct domain *d, uint64_t *size);
 
+/* Helper to invoke for deferred releasing of a top-level shadow's reference. */
+void shadow_put_top_level(struct domain *d, pagetable_t old);
+
 #else /* !CONFIG_SHADOW_PAGING */
 
 #define shadow_vcpu_teardown(v) ASSERT(is_pv_vcpu(v))
@@ -121,6 +124,11 @@ static inline void shadow_prepare_page_t
 
 static inline void shadow_blow_tables_per_domain(struct domain *d) {}
 
+static inline void shadow_put_top_level(struct domain *d, pagetable_t old)
+{
+    ASSERT_UNREACHABLE();
+}
+
 static inline int shadow_domctl(struct domain *d,
                                 struct xen_domctl_shadow_op *sc,
                                 XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -739,11 +739,13 @@ static bool cf_check hap_invlpg(struct v
     return 1;
 }
 
-static void cf_check hap_update_cr3(
-    struct vcpu *v, int do_locking, bool noflush)
+static pagetable_t cf_check hap_update_cr3(
+    struct vcpu *v, bool do_locking, bool noflush)
 {
     v->arch.hvm.hw_cr[3] = v->arch.hvm.guest_cr[3];
     hvm_update_guest_cr3(v, noflush);
+
+    return pagetable_null();
 }
 
 static bool flush_vcpu(const struct vcpu *v, const unsigned long *vcpu_bitmap)
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -2590,13 +2590,13 @@ void cf_check shadow_update_paging_modes
 }
 
 /* Set up the top-level shadow and install it in slot 'slot' of shadow_table */
-void sh_set_toplevel_shadow(struct vcpu *v,
-                            unsigned int slot,
-                            mfn_t gmfn,
-                            unsigned int root_type,
-                            mfn_t (*make_shadow)(struct vcpu *v,
-                                                 mfn_t gmfn,
-                                                 uint32_t shadow_type))
+pagetable_t sh_set_toplevel_shadow(struct vcpu *v,
+                                   unsigned int slot,
+                                   mfn_t gmfn,
+                                   unsigned int root_type,
+                                   mfn_t (*make_shadow)(struct vcpu *v,
+                                                        mfn_t gmfn,
+                                                        uint32_t shadow_type))
 {
     mfn_t smfn;
     pagetable_t old_entry, new_entry;
@@ -2653,20 +2653,37 @@ void sh_set_toplevel_shadow(struct vcpu
                   mfn_x(gmfn), mfn_x(pagetable_get_mfn(new_entry)));
     v->arch.paging.shadow.shadow_table[slot] = new_entry;
 
-    /* Decrement the refcount of the old contents of this slot */
-    if ( !pagetable_is_null(old_entry) )
+    /*
+     * Decrement the refcount of the old contents of this slot, unless
+     * we're still running on that shadow - in that case it'll need holding
+     * on to until the actual page table switch did occur.
+     */
+    if ( !pagetable_is_null(old_entry) && (v != current || !is_pv_domain(d)) )
     {
-        mfn_t old_smfn = pagetable_get_mfn(old_entry);
-        /* Need to repin the old toplevel shadow if it's been unpinned
-         * by shadow_prealloc(): in PV mode we're still running on this
-         * shadow and it's not safe to free it yet. */
-        if ( !mfn_to_page(old_smfn)->u.sh.pinned && !sh_pin(d, old_smfn) )
-        {
-            printk(XENLOG_G_ERR "can't re-pin %"PRI_mfn"\n", mfn_x(old_smfn));
-            domain_crash(d);
-        }
-        sh_put_ref(d, old_smfn, 0);
+        sh_put_ref(d, pagetable_get_mfn(old_entry), 0);
+        old_entry = pagetable_null();
     }
+
+    /*
+     * 2- and 3-level shadow mode is used for HVM only. Therefore we never run
+     * on such a shadow, so only call sites requesting an L4 shadow need to pay
+     * attention to the returned value.
+     */
+    ASSERT(pagetable_is_null(old_entry) || root_type == SH_type_l4_64_shadow);
+
+    return old_entry;
+}
+
+/*
+ * Helper invoked when releasing of a top-level shadow's reference was
+ * deferred in sh_set_toplevel_shadow() above.
+ */
+void shadow_put_top_level(struct domain *d, pagetable_t old_entry)
+{
+    ASSERT(!pagetable_is_null(old_entry));
+    paging_lock(d);
+    sh_put_ref(d, pagetable_get_mfn(old_entry), 0);
+    paging_unlock(d);
 }
 
 /**************************************************************************/
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3224,7 +3224,8 @@ static void cf_check sh_detach_old_table
     }
 }
 
-static void cf_check sh_update_cr3(struct vcpu *v, int do_locking, bool noflush)
+static pagetable_t cf_check sh_update_cr3(struct vcpu *v, bool do_locking,
+                                          bool noflush)
 /* Updates vcpu->arch.cr3 after the guest has changed CR3.
  * Paravirtual guests should set v->arch.guest_table (and guest_table_user,
  * if appropriate).
@@ -3238,6 +3239,7 @@ static void cf_check sh_update_cr3(struc
 {
     struct domain *d = v->domain;
     mfn_t gmfn;
+    pagetable_t old_entry = pagetable_null();
 #if GUEST_PAGING_LEVELS == 3
     const guest_l3e_t *gl3e;
     unsigned int i, guest_idx;
@@ -3247,7 +3249,7 @@ static void cf_check sh_update_cr3(struc
     if ( !is_hvm_domain(d) && !v->is_initialised )
     {
         ASSERT(v->arch.cr3 == 0);
-        return;
+        return old_entry;
     }
 
     if ( do_locking ) paging_lock(v->domain);
@@ -3320,11 +3322,12 @@ static void cf_check sh_update_cr3(struc
 #if GUEST_PAGING_LEVELS == 4
     if ( sh_remove_write_access(d, gmfn, 4, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
-    sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l4_shadow, sh_make_shadow);
+    old_entry = sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l4_shadow,
+                                       sh_make_shadow);
     if ( unlikely(pagetable_is_null(v->arch.paging.shadow.shadow_table[0])) )
     {
         ASSERT(d->is_dying || d->is_shutting_down);
-        return;
+        return old_entry;
     }
     if ( !shadow_mode_external(d) && !is_pv_32bit_domain(d) )
     {
@@ -3368,24 +3371,30 @@ static void cf_check sh_update_cr3(struc
                 gl2gfn = guest_l3e_get_gfn(gl3e[i]);
                 gl2mfn = get_gfn_query_unlocked(d, gfn_x(gl2gfn), &p2mt);
                 if ( p2m_is_ram(p2mt) )
-                    sh_set_toplevel_shadow(v, i, gl2mfn, SH_type_l2_shadow,
-                                           sh_make_shadow);
+                    old_entry = sh_set_toplevel_shadow(v, i, gl2mfn,
+                                                       SH_type_l2_shadow,
+                                                       sh_make_shadow);
                 else
-                    sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
-                                           sh_make_shadow);
+                    old_entry = sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
+                                                       sh_make_shadow);
             }
             else
-                sh_set_toplevel_shadow(v, i, INVALID_MFN, 0, sh_make_shadow);
+                old_entry = sh_set_toplevel_shadow(v, i, INVALID_MFN, 0,
+                                                   sh_make_shadow);
+
+            ASSERT(pagetable_is_null(old_entry));
         }
     }
 #elif GUEST_PAGING_LEVELS == 2
     if ( sh_remove_write_access(d, gmfn, 2, 0) != 0 )
         guest_flush_tlb_mask(d, d->dirty_cpumask);
-    sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l2_shadow, sh_make_shadow);
+    old_entry = sh_set_toplevel_shadow(v, 0, gmfn, SH_type_l2_shadow,
+                                       sh_make_shadow);
+    ASSERT(pagetable_is_null(old_entry));
     if ( unlikely(pagetable_is_null(v->arch.paging.shadow.shadow_table[0])) )
     {
         ASSERT(d->is_dying || d->is_shutting_down);
-        return;
+        return old_entry;
     }
 #else
 #error This should never happen
@@ -3473,6 +3482,8 @@ static void cf_check sh_update_cr3(struc
 
     /* Release the lock, if we took it (otherwise it's the caller's problem) */
     if ( do_locking ) paging_unlock(v->domain);
+
+    return old_entry;
 }
 
 
--- a/xen/arch/x86/mm/shadow/none.c
+++ b/xen/arch/x86/mm/shadow/none.c
@@ -52,9 +52,11 @@ static unsigned long cf_check _gva_to_gf
 }
 #endif
 
-static void cf_check _update_cr3(struct vcpu *v, int do_locking, bool noflush)
+static pagetable_t cf_check _update_cr3(struct vcpu *v, bool do_locking,
+                                        bool noflush)
 {
     ASSERT_UNREACHABLE();
+    return pagetable_null();
 }
 
 static void cf_check _update_paging_modes(struct vcpu *v)
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -391,13 +391,13 @@ mfn_t shadow_alloc(struct domain *d,
 void  shadow_free(struct domain *d, mfn_t smfn);
 
 /* Set up the top-level shadow and install it in slot 'slot' of shadow_table */
-void sh_set_toplevel_shadow(struct vcpu *v,
-                            unsigned int slot,
-                            mfn_t gmfn,
-                            unsigned int root_type,
-                            mfn_t (*make_shadow)(struct vcpu *v,
-                                                 mfn_t gmfn,
-                                                 uint32_t shadow_type));
+pagetable_t sh_set_toplevel_shadow(struct vcpu *v,
+                                   unsigned int slot,
+                                   mfn_t gmfn,
+                                   unsigned int root_type,
+                                   mfn_t (*make_shadow)(struct vcpu *v,
+                                                        mfn_t gmfn,
+                                                        uint32_t shadow_type));
 
 /* Update the shadows in response to a pagetable write from Xen */
 int sh_validate_guest_entry(struct vcpu *v, mfn_t gmfn, void *entry, u32 size);
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -567,15 +567,12 @@ void write_ptbase(struct vcpu *v)
  *
  * Update ref counts to shadow tables appropriately.
  */
-void update_cr3(struct vcpu *v)
+pagetable_t update_cr3(struct vcpu *v)
 {
     mfn_t cr3_mfn;
 
     if ( paging_mode_enabled(v->domain) )
-    {
-        paging_update_cr3(v, false);
-        return;
-    }
+        return paging_update_cr3(v, false);
 
     if ( !(v->arch.flags & TF_kernel_mode) )
         cr3_mfn = pagetable_get_mfn(v->arch.guest_table_user);
@@ -583,6 +580,8 @@ void update_cr3(struct vcpu *v)
         cr3_mfn = pagetable_get_mfn(v->arch.guest_table);
 
     make_cr3(v, cr3_mfn);
+
+    return pagetable_null();
 }
 
 static inline void set_tlbflush_timestamp(struct page_info *page)
@@ -3285,6 +3284,7 @@ int new_guest_cr3(mfn_t mfn)
     struct domain *d = curr->domain;
     int rc;
     mfn_t old_base_mfn;
+    pagetable_t old_shadow;
 
     if ( is_pv_32bit_domain(d) )
     {
@@ -3352,9 +3352,22 @@ int new_guest_cr3(mfn_t mfn)
     if ( !VM_ASSIST(d, m2p_strict) )
         fill_ro_mpt(mfn);
     curr->arch.guest_table = pagetable_from_mfn(mfn);
-    update_cr3(curr);
+    old_shadow = update_cr3(curr);
+
+    /*
+     * In shadow mode update_cr3() can fail, in which case here we're still
+     * running on the prior top-level shadow (which we're about to release).
+     * Switch to the idle page tables in such an event; the guest will have
+     * been crashed already.
+     */
+    if ( likely(!mfn_eq(pagetable_get_mfn(old_shadow),
+                        maddr_to_mfn(curr->arch.cr3 & ~X86_CR3_NOFLUSH))) )
+        write_ptbase(curr);
+    else
+        write_ptbase(idle_vcpu[curr->processor]);
 
-    write_ptbase(curr);
+    if ( !pagetable_is_null(old_shadow) )
+        shadow_put_top_level(d, old_shadow);
 
     if ( likely(mfn_x(old_base_mfn) != 0) )
     {
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -424,10 +424,13 @@ bool __init xpti_pcid_enabled(void)
 
 static void _toggle_guest_pt(struct vcpu *v)
 {
+    bool guest_update;
+    pagetable_t old_shadow;
     unsigned long cr3;
 
     v->arch.flags ^= TF_kernel_mode;
-    update_cr3(v);
+    guest_update = v->arch.flags & TF_kernel_mode;
+    old_shadow = update_cr3(v);
 
     /*
      * Don't flush user global mappings from the TLB. Don't tick TLB clock.
@@ -436,13 +439,31 @@ static void _toggle_guest_pt(struct vcpu
      * TLB flush (for just the incoming PCID), as the top level page table may
      * have changed behind our backs. To be on the safe side, suppress the
      * no-flush unconditionally in this case.
+     *
+     * Furthermore in shadow mode update_cr3() can fail, in which case here
+     * we're still running on the prior top-level shadow (which we're about
+     * to release). Switch to the idle page tables in such an event; the
+     * guest will have been crashed already.
      */
     cr3 = v->arch.cr3;
     if ( shadow_mode_enabled(v->domain) )
+    {
         cr3 &= ~X86_CR3_NOFLUSH;
+
+        if ( unlikely(mfn_eq(pagetable_get_mfn(old_shadow),
+                             maddr_to_mfn(cr3))) )
+        {
+            cr3 = idle_vcpu[v->processor]->arch.cr3;
+            /* Also suppress runstate/time area updates below. */
+            guest_update = false;
+        }
+    }
     write_cr3(cr3);
 
-    if ( !(v->arch.flags & TF_kernel_mode) )
+    if ( !pagetable_is_null(old_shadow) )
+        shadow_put_top_level(v->domain, old_shadow);
+
+    if ( !guest_update )
         return;
 
     if ( v->arch.pv.need_update_runstate_area && update_runstate_area(v) )