Xen Security Advisory 494 v3 (CVE-2026-42488) - x86: mismatched mapcache metadata

Xen.org security team posted 1 patch 3 days, 18 hours ago
Failed in applying to current master (apply log)
xen/arch/x86/domain_page.c           | 48 ++++++++++++----------------
xen/arch/x86/flushtlb.c              |  5 ++-
xen/arch/x86/include/asm/domain.h    |  1 -
xen/arch/x86/include/asm/flushtlb.h  |  2 +-
xen/arch/x86/include/asm/processor.h |  3 ++
xen/arch/x86/mm.c                    |  4 +--
xen/arch/x86/pv/dom0_build.c         | 12 +++----
xen/arch/x86/pv/domain.c             | 13 ++++++--
xen/arch/x86/smpboot.c               |  1 +
xen/common/efi/common-stub.c         |  5 ---
xen/common/efi/runtime.c             | 21 +++++-------
xen/include/xen/efi.h                |  1 -
12 files changed, 54 insertions(+), 62 deletions(-)
Xen Security Advisory 494 v3 (CVE-2026-42488) - x86: mismatched mapcache metadata
Posted by Xen.org security team 3 days, 18 hours ago
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

            Xen Security Advisory CVE-2026-42488 / XSA-494
                               version 3

                   x86: mismatched mapcache metadata

UPDATES IN VERSION 3
====================

Public release.

ISSUE DESCRIPTION
=================

Some shadow paging errors paths will switch the page-tables without
updating the currently running vCPU reference.  This causes a mismatch
between the loaded page-tables and the mapcache metadata which can lead
to corruption of the mapcache.

IMPACT
======

Privilege escalation, Denial of Service (DoS) affecting the entire host,
and information leaks.

VULNERABLE SYSTEMS
==================

Xen 4.15 and onwards are vulnerable.  Any Xen version with the fix for
XSA-438 applied is vulnerable.

Only x86 systems are vulnerable.  Only 64-bit PV guests can leverage the
vulnerability, and only when running in shadow mode.  Shadow mode would
be in use when migrating guests or as a workaround for XSA-273 (L1TF).

MITIGATION
==========

Running only HVM or PVH guests will avoid the vulnerability.

Running PV guests in the PV shim will also avoid the vulnerability.

CREDITS
=======

This issue was discovered by Roger Pau Monné of XenServer.

RESOLUTION
==========

Applying the attached patch resolves this issue.

Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball.  Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.

xsa494.patch           xen-unstable
xsa494-4.21.patch      Xen 4.21.x
xsa494-4.20.patch      Xen 4.20.x - Xen 4.19.x
xsa494-4.18.patch      Xen 4.18.x
xsa494-4.17.patch      Xen 4.17.x

$ sha256sum xsa494*
6e3328f73000afdfffa5e4d9fec89a4c9456d97758bfa1a0605765a386565328  xsa494.patch
483675d6cb69b70e919110f58814b047787c3b53def344cf32f4acdd7ee9b271  xsa494-4.17.patch
e637dce8cd5ecf7c30501ab2eb0af5240ff0a36844b257ca7dd14094d5118aa2  xsa494-4.18.patch
a70aa60fb5dcf171025c5d90e332dcae95a83bbf9d42ab45451f629621f455e5  xsa494-4.20.patch
14f9698060c523893f710cc5ab3ec723c75a99e5caa193b9281d4a06016bf687  xsa494-4.21.patch
$

DEPLOYMENT DURING EMBARGO
=========================

Deployment of the patches and/or mitigations described above (or
others which are substantially similar) is permitted during the
embargo, even on public-facing systems with untrusted guest users and
administrators.

But: Distribution of updated software is prohibited (except to other
members of the predisclosure list).

Predisclosure list members who wish to deploy significantly different
patches and/or mitigations, please contact the Xen Project Security
Team.

(Note: this during-embargo deployment notice is retained in
post-embargo publicly released Xen Project advisories, even though it
is then no longer applicable.  This is to enable the community to have
oversight of the Xen Project Security Team's decisionmaking.)

For more information about permissible uses of embargoed information,
consult the Xen Project community's agreed Security Policy:
  http://www.xenproject.org/security-policy.html
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmon+5MMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZ+HcIAIpJbk3ISxjsn0ZFBXR01iOGubj+Y/vKE4mdJe1y
1//aeWPL26enDoyZ5KoT+hiC2qogTfT1p71MIS0Gns44UfVOw95xlrd0eUO//5td
NQk7YFYn/WB+z9KWcdV8+Lo3zKiMNFiILCeK2+WefByfBQfZ/WFBQ48WZpxnkxHo
j7cgtmtmTStmIDEWxY0pfdEWHPCBGX3SvUGWKR2tl5tZZxjd+yIij4fjLzUCKxU3
r4dYblTAg0JyDsI2SR16TLRSKyWxnwprzlb2fJEDsZXoZvIetf6jhHpvfFY+Z2m1
zlLfFDam+oGQI1CwrMNCz69AaeJzyTnRdiY+BM51lpgdjj0=
=pLmw
-----END PGP SIGNATURE-----
From 69c43e88a7738a7f85f2c6a18f93d897c39eb6f6 Mon Sep 17 00:00:00 2001
From: Roger Pau Monne <roger.pau@citrix.com>
Date: Mon, 16 Mar 2026 11:03:22 +0100
Subject: [PATCH] x86/mm: accurately track which vCPU page-tables are loaded
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Neither current nor curr_vcpu per-CPU fields accurately track which
page-tables are loaded.  There are corner cases when dealing with shadow
paging failures that switch to the idle vCPU page-tables without changing
current or curr_vcpu per-CPU fields.

Introduce a new per-CPU field that attempts to track which vCPU page-tables
are loaded.  Update such tracking when cr3 is changed, and do so in a
region with interrupts disabled, as to avoid handling interrupts with a
mismatch between the vCPU tracking field and the loaded page-tables.

As a result of this newly more accurate tracking the mapcache override
functionality can be removed: the dom0 PV builder was the only user of it,
and it's updated here to properly signal which vCPU page-tables are loaded
in the calls to switch_cr3_cr4().

Note the EFI page-tables have the Xen owned L4 slots copied from the idle
page-tables, so for the effects of the mapcache the EFI page-tables could
use the idle mapcache if it had one.  Pass the idle vCPU in the
switch_cr3_cr4() call that switches to the runtime EFI page-tables.

There are known issues with the use of mapcache in NMI context.  This patch
does not alter the behaviour.

This is CVE-2026-42488 / XSA-494.

Fixes: fb0ff49fe9f7 ("x86/shadow: defer releasing of PV's top-level shadow reference")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/domain_page.c           | 48 ++++++++++++----------------
 xen/arch/x86/flushtlb.c              |  5 ++-
 xen/arch/x86/include/asm/domain.h    |  1 -
 xen/arch/x86/include/asm/flushtlb.h  |  2 +-
 xen/arch/x86/include/asm/processor.h |  3 ++
 xen/arch/x86/mm.c                    |  4 +--
 xen/arch/x86/pv/dom0_build.c         | 12 +++----
 xen/arch/x86/pv/domain.c             | 13 ++++++--
 xen/arch/x86/smpboot.c               |  1 +
 xen/common/efi/common-stub.c         |  5 ---
 xen/common/efi/runtime.c             | 21 +++++-------
 xen/include/xen/efi.h                |  1 -
 12 files changed, 54 insertions(+), 62 deletions(-)

diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index eac5e3304fb8..72c00194f315 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -18,48 +18,40 @@
 #include <asm/hardirq.h>
 #include <asm/setup.h>
 
-static DEFINE_PER_CPU(struct vcpu *, override);
-
 static inline struct vcpu *mapcache_current_vcpu(void)
 {
-    /* In the common case we use the mapcache of the running VCPU. */
-    struct vcpu *v = this_cpu(override) ?: current;
-
-    /*
-     * When current isn't properly set up yet, this is equivalent to
-     * running in an idle vCPU (callers must check for NULL).
-     */
-    if ( !v )
-        return NULL;
+    struct vcpu *v = this_cpu(pgtable_vcpu);
+    struct vcpu *curr = current;
 
     /*
-     * When using efi runtime page tables, we have the equivalent of the idle
-     * domain's page tables but current may point at another domain's VCPU.
-     * Return NULL as though current is not properly set up yet.
+     * During early boot pgtable_vcpu is not set, callers must handle NULL.
+     * Non-PV domains don't have a mapcache, the directmap covers all physical
+     * address space.
      */
-    if ( efi_rs_using_pgtables() )
+    if ( !v || !is_pv_vcpu(v) )
         return NULL;
 
     /*
-     * If guest_table is NULL, and we are running a paravirtualised guest,
-     * then it means we are running on the idle domain's page table and must
-     * therefore use its mapcache.
+     * If we are in a lazy context-switch state from a PV vCPU do a full switch
+     * to the idle vCPU now, otherwise an incoming FLUSH_VCPU_STATE IPI would
+     * change the page tables under our feet an invalidate any in-use mapcache
+     * entries.
      */
-    if ( unlikely(pagetable_is_null(v->arch.guest_table)) && is_pv_vcpu(v) )
+    if ( unlikely(this_cpu(curr_vcpu) != curr) )
     {
-        /* If we really are idling, perform lazy context switch now. */
-        if ( (v = idle_vcpu[smp_processor_id()]) == current )
-            sync_local_execstate();
+        ASSERT(curr == idle_vcpu[smp_processor_id()]);
+        sync_local_execstate();
         /* We must now be running on the idle page table. */
         ASSERT(cr3_pa(read_cr3()) == __pa(idle_pg_table));
     }
 
-    return v;
-}
-
-void __init mapcache_override_current(struct vcpu *v)
-{
-    this_cpu(override) = v;
+    /*
+     * At this point we can guarantee Xen is not in lazy context switch: either
+     * the code above will have synced the state, or an incoming
+     * FLUSH_VCPU_STATE IPI has done so behind our back.  Use ACCESS_ONCE to
+     * ensure the compiler never returns the locally cached pgtable_vcpu value.
+     */
+    return ACCESS_ONCE(this_cpu(pgtable_vcpu));
 }
 
 #define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER)
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 23721bb52c90..5e2ed50ec971 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -104,7 +104,9 @@ static void do_tlb_flush(void)
     local_irq_restore(flags);
 }
 
-void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
+DEFINE_PER_CPU(struct vcpu *, pgtable_vcpu);
+
+void switch_cr3_cr4(struct vcpu *v, unsigned long cr3, unsigned long cr4)
 {
     unsigned long flags, old_cr4;
     u32 t = 0;
@@ -148,6 +150,7 @@ void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
     if ( (old_cr4 & X86_CR4_PCIDE) > (cr4 & X86_CR4_PCIDE) )
         cr3 |= X86_CR3_NOFLUSH;
     write_cr3(cr3);
+    this_cpu(pgtable_vcpu) = v;
 
     if ( old_cr4 != cr4 )
         write_cr4(cr4);
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 385a6666dafa..e0ce8b4c39c8 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -90,7 +90,6 @@ struct mapcache_domain {
 
 int mapcache_domain_init(struct domain *d);
 int mapcache_vcpu_init(struct vcpu *v);
-void mapcache_override_current(struct vcpu *v);
 
 /* x86/64: toggle guest between kernel and user modes. */
 void toggle_guest_mode(struct vcpu *v);
diff --git a/xen/arch/x86/include/asm/flushtlb.h b/xen/arch/x86/include/asm/flushtlb.h
index 7bcbca2b7f31..345677eb72ae 100644
--- a/xen/arch/x86/include/asm/flushtlb.h
+++ b/xen/arch/x86/include/asm/flushtlb.h
@@ -104,7 +104,7 @@ static inline void invlpg(const void *p)
 }
 
 /* Write pagetable base and implicitly tick the tlbflush clock. */
-void switch_cr3_cr4(unsigned long cr3, unsigned long cr4);
+void switch_cr3_cr4(struct vcpu *v, unsigned long cr3, unsigned long cr4);
 
 /* flush_* flag fields: */
  /*
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index c37bd7a17658..8ca6799a8113 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -329,6 +329,9 @@ DECLARE_PER_CPU(struct tss_page, tss_page);
 
 DECLARE_PER_CPU(root_pgentry_t *, root_pgt);
 
+/* vCPU of the currently loaded page-tables. */
+DECLARE_PER_CPU(struct vcpu *, pgtable_vcpu);
+
 extern void write_ptbase(struct vcpu *v);
 
 /* PAUSE (encoding: REP NOP) is a good thing to insert into busy-wait loops. */
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index a158379e7734..bb4ba0afe2d4 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -535,7 +535,7 @@ void write_ptbase(struct vcpu *v)
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
         if ( new_cr4 & X86_CR4_PCIDE )
             cpu_info->pv_cr3 |= get_pcid_bits(v, true);
-        switch_cr3_cr4(v->arch.cr3, new_cr4);
+        switch_cr3_cr4(v, v->arch.cr3, new_cr4);
     }
     else
     {
@@ -543,7 +543,7 @@ void write_ptbase(struct vcpu *v)
         cpu_info->use_pv_cr3 = false;
         cpu_info->xen_cr3 = 0;
         /* switch_cr3_cr4() serializes. */
-        switch_cr3_cr4(v->arch.cr3, new_cr4);
+        switch_cr3_cr4(v, v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
 }
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 12d8ba744a9a..ddeb144b0696 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -831,8 +831,7 @@ static int __init dom0_construct(const struct boot_domain *bd)
     update_cr3(v);
 
     /* We run on dom0's page tables for the final part of the build process. */
-    switch_cr3_cr4(cr3_pa(v->arch.cr3), read_cr4());
-    mapcache_override_current(v);
+    switch_cr3_cr4(v, cr3_pa(v->arch.cr3), read_cr4());
 
     /* Copy the OS image and free temporary buffer. */
     elf.dest_base = (void*)vkern_start;
@@ -841,8 +840,7 @@ static int __init dom0_construct(const struct boot_domain *bd)
     rc = elf_load_binary(&elf);
     if ( rc < 0 )
     {
-        mapcache_override_current(NULL);
-        switch_cr3_cr4(current->arch.cr3, read_cr4());
+        switch_cr3_cr4(current, current->arch.cr3, read_cr4());
         printk("Failed to load the kernel binary\n");
         goto out;
     }
@@ -853,8 +851,7 @@ static int __init dom0_construct(const struct boot_domain *bd)
         if ( (parms.virt_hypercall < v_start) ||
              (parms.virt_hypercall >= v_end) )
         {
-            mapcache_override_current(NULL);
-            switch_cr3_cr4(current->arch.cr3, read_cr4());
+            switch_cr3_cr4(current, current->arch.cr3, read_cr4());
             printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
             return -EINVAL;
         }
@@ -995,8 +992,7 @@ static int __init dom0_construct(const struct boot_domain *bd)
 #endif
 
     /* Return to idle domain's page tables. */
-    mapcache_override_current(NULL);
-    switch_cr3_cr4(current->arch.cr3, read_cr4());
+    switch_cr3_cr4(current, current->arch.cr3, read_cr4());
 
     update_domain_wallclock_time(d);
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 5027c32e1a92..0c42ae58aaa1 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -457,6 +457,8 @@ static void _toggle_guest_pt(struct vcpu *v)
     pagetable_t old_shadow;
     unsigned long cr3;
 
+    ASSERT(local_irq_is_enabled());
+
     v->arch.flags ^= TF_kernel_mode;
     guest_update = v->arch.flags & TF_kernel_mode;
     old_shadow = update_cr3(v);
@@ -479,15 +481,22 @@ static void _toggle_guest_pt(struct vcpu *v)
     {
         cr3 &= ~X86_CR3_NOFLUSH;
 
+        local_irq_disable();
         if ( unlikely(mfn_eq(pagetable_get_mfn(old_shadow),
                              maddr_to_mfn(cr3))) )
         {
-            cr3 = idle_vcpu[v->processor]->arch.cr3;
             /* Also suppress runstate/time area updates below. */
             guest_update = false;
+
+            cr3 = idle_vcpu[v->processor]->arch.cr3;
+            this_cpu(pgtable_vcpu) = idle_vcpu[v->processor];
         }
+
+        write_cr3(cr3);
+        local_irq_enable();
     }
-    write_cr3(cr3);
+    else
+        write_cr3(cr3);
 
     if ( !pagetable_is_null(old_shadow) )
         shadow_put_top_level(v->domain, old_shadow);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index ff05955bae40..d8fd71ffab37 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -1063,6 +1063,7 @@ static int cpu_smpboot_alloc(unsigned int cpu)
 
     info->current_vcpu = idle_vcpu[cpu]; /* set_current() */
     per_cpu(curr_vcpu, cpu) = idle_vcpu[cpu];
+    per_cpu(pgtable_vcpu, cpu) = idle_vcpu[cpu];
 
     gdt = per_cpu(gdt, cpu) ?: alloc_xenheap_pages(0, memflags);
     if ( gdt == NULL )
diff --git a/xen/common/efi/common-stub.c b/xen/common/efi/common-stub.c
index 9dc8aa538cc1..f91ba6f74011 100644
--- a/xen/common/efi/common-stub.c
+++ b/xen/common/efi/common-stub.c
@@ -7,11 +7,6 @@ bool efi_enabled(unsigned int feature)
     return false;
 }
 
-bool efi_rs_using_pgtables(void)
-{
-    return false;
-}
-
 unsigned long efi_get_time(void)
 {
     BUG();
diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
index 0f1cc765ec5e..a23fa75e3740 100644
--- a/xen/common/efi/runtime.c
+++ b/xen/common/efi/runtime.c
@@ -50,7 +50,6 @@ const CHAR16 *__read_mostly efi_fw_vendor;
 const EFI_RUNTIME_SERVICES *__read_mostly efi_rs;
 #ifndef CONFIG_ARM /* TODO - disabled until implemented on ARM */
 static DEFINE_SPINLOCK(efi_rs_lock);
-static unsigned int efi_rs_on_cpu = NR_CPUS;
 #endif
 
 UINTN __read_mostly efi_memmap_size;
@@ -93,6 +92,11 @@ struct efi_rs_state efi_rs_enter(void)
     if ( mfn_eq(efi_l4_mfn, INVALID_MFN) )
         return state;
 
+    /*
+     * If in lazy idle context switch state sync now to avoid an incoming
+     * FLUSH_VCPU_STATE IPI changing the loaded page-tables.
+     */
+    sync_local_execstate();
     state.cr3 = read_cr3();
     vcpu_save_fpu(current);
     asm volatile ( "fnclex; fldcw %0" :: "m" (fcw) );
@@ -100,8 +104,6 @@ struct efi_rs_state efi_rs_enter(void)
 
     spin_lock(&efi_rs_lock);
 
-    efi_rs_on_cpu = smp_processor_id();
-
     /* prevent fixup_page_fault() from doing anything */
     irq_enter();
 
@@ -116,7 +118,8 @@ struct efi_rs_state efi_rs_enter(void)
         lgdt(&gdt_desc);
     }
 
-    switch_cr3_cr4(mfn_to_maddr(efi_l4_mfn), read_cr4());
+    switch_cr3_cr4(idle_vcpu[smp_processor_id()], mfn_to_maddr(efi_l4_mfn),
+                   read_cr4());
 
     /*
      * At the time of writing (2022), no UEFI firwmare is CET-IBT compatible.
@@ -144,7 +147,7 @@ void efi_rs_leave(struct efi_rs_state *state)
     if ( state->msr_s_cet )
         wrmsrl(MSR_S_CET, state->msr_s_cet);
 
-    switch_cr3_cr4(state->cr3, read_cr4());
+    switch_cr3_cr4(curr, state->cr3, read_cr4());
     if ( is_pv_vcpu(curr) && !is_idle_vcpu(curr) )
     {
         struct desc_ptr gdt_desc = {
@@ -155,18 +158,10 @@ void efi_rs_leave(struct efi_rs_state *state)
         lgdt(&gdt_desc);
     }
     irq_exit();
-    efi_rs_on_cpu = NR_CPUS;
     spin_unlock(&efi_rs_lock);
     vcpu_restore_fpu(curr);
 }
 
-bool efi_rs_using_pgtables(void)
-{
-    return !mfn_eq(efi_l4_mfn, INVALID_MFN) &&
-           (smp_processor_id() == efi_rs_on_cpu) &&
-           (read_cr3() == mfn_to_maddr(efi_l4_mfn));
-}
-
 unsigned long efi_get_time(void)
 {
     EFI_TIME time;
diff --git a/xen/include/xen/efi.h b/xen/include/xen/efi.h
index 2e36b01e205b..87146172ad75 100644
--- a/xen/include/xen/efi.h
+++ b/xen/include/xen/efi.h
@@ -40,7 +40,6 @@ extern bool efi_secure_boot;
 
 void efi_init_memory(void);
 bool efi_boot_mem_unused(unsigned long *start, unsigned long *end);
-bool efi_rs_using_pgtables(void);
 unsigned long efi_get_time(void);
 void efi_reset_system(bool warm);
 #ifndef COMPAT
-- 
2.53.0

From: Roger Pau Monne <roger.pau@citrix.com>
Subject: x86/mm: accurately track which vCPU page-tables are loaded

Neither current nor curr_vcpu per-CPU fields accurately track which
page-tables are loaded.  There are corner cases when dealing with shadow
paging failures that switch to the idle vCPU page-tables without changing
current or curr_vcpu per-CPU fields.

Introduce a new per-CPU field that attempts to track which vCPU page-tables
are loaded.  Update such tracking when cr3 is changed, and do so in a
region with interrupts disabled, as to avoid handling interrupts with a
mismatch between the vCPU tracking field and the loaded page-tables.

As a result of this newly more accurate tracking the mapcache override
functionality can be removed: the dom0 PV builder was the only user of it,
and it's updated here to properly signal which vCPU page-tables are loaded
in the calls to switch_cr3_cr4().

Note the EFI page-tables have the Xen owned L4 slots copied from the idle
page-tables, so for the effects of the mapcache the EFI page-tables could
use the idle mapcache if it had one.  Pass the idle vCPU in the
switch_cr3_cr4() call that switches to the runtime EFI page-tables.

There are known issues with the use of mapcache in NMI context.  This patch
does not alter the behaviour.

This is CVE-2026-42488 / XSA-494.

Fixes: fb0ff49fe9f7 ("x86/shadow: defer releasing of PV's top-level shadow reference")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index eac5e3304fb8..72c00194f315 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -18,48 +18,40 @@
 #include <asm/hardirq.h>
 #include <asm/setup.h>
 
-static DEFINE_PER_CPU(struct vcpu *, override);
-
 static inline struct vcpu *mapcache_current_vcpu(void)
 {
-    /* In the common case we use the mapcache of the running VCPU. */
-    struct vcpu *v = this_cpu(override) ?: current;
-
-    /*
-     * When current isn't properly set up yet, this is equivalent to
-     * running in an idle vCPU (callers must check for NULL).
-     */
-    if ( !v )
-        return NULL;
+    struct vcpu *v = this_cpu(pgtable_vcpu);
+    struct vcpu *curr = current;
 
     /*
-     * When using efi runtime page tables, we have the equivalent of the idle
-     * domain's page tables but current may point at another domain's VCPU.
-     * Return NULL as though current is not properly set up yet.
+     * During early boot pgtable_vcpu is not set, callers must handle NULL.
+     * Non-PV domains don't have a mapcache, the directmap covers all physical
+     * address space.
      */
-    if ( efi_rs_using_pgtables() )
+    if ( !v || !is_pv_vcpu(v) )
         return NULL;
 
     /*
-     * If guest_table is NULL, and we are running a paravirtualised guest,
-     * then it means we are running on the idle domain's page table and must
-     * therefore use its mapcache.
+     * If we are in a lazy context-switch state from a PV vCPU do a full switch
+     * to the idle vCPU now, otherwise an incoming FLUSH_VCPU_STATE IPI would
+     * change the page tables under our feet an invalidate any in-use mapcache
+     * entries.
      */
-    if ( unlikely(pagetable_is_null(v->arch.guest_table)) && is_pv_vcpu(v) )
+    if ( unlikely(this_cpu(curr_vcpu) != curr) )
     {
-        /* If we really are idling, perform lazy context switch now. */
-        if ( (v = idle_vcpu[smp_processor_id()]) == current )
-            sync_local_execstate();
+        ASSERT(curr == idle_vcpu[smp_processor_id()]);
+        sync_local_execstate();
         /* We must now be running on the idle page table. */
         ASSERT(cr3_pa(read_cr3()) == __pa(idle_pg_table));
     }
 
-    return v;
-}
-
-void __init mapcache_override_current(struct vcpu *v)
-{
-    this_cpu(override) = v;
+    /*
+     * At this point we can guarantee Xen is not in lazy context switch: either
+     * the code above will have synced the state, or an incoming
+     * FLUSH_VCPU_STATE IPI has done so behind our back.  Use ACCESS_ONCE to
+     * ensure the compiler never returns the locally cached pgtable_vcpu value.
+     */
+    return ACCESS_ONCE(this_cpu(pgtable_vcpu));
 }
 
 #define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER)
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 18748b2bc805..cdefab2f08ec 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -111,7 +111,9 @@ static void do_tlb_flush(void)
     local_irq_restore(flags);
 }
 
-void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
+DEFINE_PER_CPU(struct vcpu *, pgtable_vcpu);
+
+void switch_cr3_cr4(struct vcpu *v, unsigned long cr3, unsigned long cr4)
 {
     unsigned long flags, old_cr4;
     u32 t = 0;
@@ -155,6 +157,7 @@ void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
     if ( (old_cr4 & X86_CR4_PCIDE) > (cr4 & X86_CR4_PCIDE) )
         cr3 |= X86_CR3_NOFLUSH;
     write_cr3(cr3);
+    this_cpu(pgtable_vcpu) = v;
 
     if ( old_cr4 != cr4 )
         write_cr4(cr4);
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index f90a268b0195..588b56b3fdb8 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -76,7 +76,6 @@ struct mapcache_domain {
 
 int mapcache_domain_init(struct domain *);
 int mapcache_vcpu_init(struct vcpu *);
-void mapcache_override_current(struct vcpu *);
 
 /* x86/64: toggle guest between kernel and user modes. */
 void toggle_guest_mode(struct vcpu *);
diff --git a/xen/arch/x86/include/asm/flushtlb.h b/xen/arch/x86/include/asm/flushtlb.h
index a461ee36ffeb..821ffe3e8b16 100644
--- a/xen/arch/x86/include/asm/flushtlb.h
+++ b/xen/arch/x86/include/asm/flushtlb.h
@@ -99,7 +99,7 @@ static inline unsigned long read_cr3(void)
 }
 
 /* Write pagetable base and implicitly tick the tlbflush clock. */
-void switch_cr3_cr4(unsigned long cr3, unsigned long cr4);
+void switch_cr3_cr4(struct vcpu *v, unsigned long cr3, unsigned long cr4);
 
 /* flush_* flag fields: */
  /*
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index 07328d44bf4e..4309441944e1 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -465,6 +465,9 @@ extern idt_entry_t *idt_tables[];
 
 DECLARE_PER_CPU(root_pgentry_t *, root_pgt);
 
+/* vCPU of the currently loaded page-tables. */
+DECLARE_PER_CPU(struct vcpu *, pgtable_vcpu);
+
 extern void write_ptbase(struct vcpu *v);
 
 /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d31b8d56ffbc..4dbe86017cca 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -546,7 +546,7 @@ void write_ptbase(struct vcpu *v)
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
         if ( new_cr4 & X86_CR4_PCIDE )
             cpu_info->pv_cr3 |= get_pcid_bits(v, true);
-        switch_cr3_cr4(v->arch.cr3, new_cr4);
+        switch_cr3_cr4(v, v->arch.cr3, new_cr4);
     }
     else
     {
@@ -554,7 +554,7 @@ void write_ptbase(struct vcpu *v)
         cpu_info->use_pv_cr3 = false;
         cpu_info->xen_cr3 = 0;
         /* switch_cr3_cr4() serializes. */
-        switch_cr3_cr4(v->arch.cr3, new_cr4);
+        switch_cr3_cr4(v, v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
 }
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 3e74cf4ea2fe..25267f4edcb2 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -811,8 +811,7 @@ int __init dom0_construct_pv(struct domain *d,
         update_cr3(v);
 
     /* We run on dom0's page tables for the final part of the build process. */
-    switch_cr3_cr4(cr3_pa(v->arch.cr3), read_cr4());
-    mapcache_override_current(v);
+    switch_cr3_cr4(v, cr3_pa(v->arch.cr3), read_cr4());
 
     /* Copy the OS image and free temporary buffer. */
     elf.dest_base = (void*)vkern_start;
@@ -821,8 +820,7 @@ int __init dom0_construct_pv(struct domain *d,
     rc = elf_load_binary(&elf);
     if ( rc < 0 )
     {
-        mapcache_override_current(NULL);
-        switch_cr3_cr4(current->arch.cr3, read_cr4());
+        switch_cr3_cr4(current, current->arch.cr3, read_cr4());
         printk("Failed to load the kernel binary\n");
         goto out;
     }
@@ -833,8 +831,7 @@ int __init dom0_construct_pv(struct domain *d,
         if ( (parms.virt_hypercall < v_start) ||
              (parms.virt_hypercall >= v_end) )
         {
-            mapcache_override_current(NULL);
-            switch_cr3_cr4(current->arch.cr3, read_cr4());
+            switch_cr3_cr4(current, current->arch.cr3, read_cr4());
             printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
             return -EINVAL;
         }
@@ -975,8 +972,7 @@ int __init dom0_construct_pv(struct domain *d,
 #endif
 
     /* Return to idle domain's page tables. */
-    mapcache_override_current(NULL);
-    switch_cr3_cr4(current->arch.cr3, read_cr4());
+    switch_cr3_cr4(current, current->arch.cr3, read_cr4());
 
     update_domain_wallclock_time(d);
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 2a445bb17b99..5cd467d7a694 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -428,6 +428,8 @@ static void _toggle_guest_pt(struct vcpu *v)
     pagetable_t old_shadow;
     unsigned long cr3;
 
+    ASSERT(local_irq_is_enabled());
+
     v->arch.flags ^= TF_kernel_mode;
     guest_update = v->arch.flags & TF_kernel_mode;
     old_shadow = update_cr3(v);
@@ -450,15 +452,22 @@ static void _toggle_guest_pt(struct vcpu *v)
     {
         cr3 &= ~X86_CR3_NOFLUSH;
 
+        local_irq_disable();
         if ( unlikely(mfn_eq(pagetable_get_mfn(old_shadow),
                              maddr_to_mfn(cr3))) )
         {
-            cr3 = idle_vcpu[v->processor]->arch.cr3;
             /* Also suppress runstate/time area updates below. */
             guest_update = false;
+
+            cr3 = idle_vcpu[v->processor]->arch.cr3;
+            this_cpu(pgtable_vcpu) = idle_vcpu[v->processor];
         }
+
+        write_cr3(cr3);
+        local_irq_enable();
     }
-    write_cr3(cr3);
+    else
+        write_cr3(cr3);
 
     if ( !pagetable_is_null(old_shadow) )
         shadow_put_top_level(v->domain, old_shadow);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 7aa899dac336..a8a4739f5d5d 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -335,6 +335,7 @@ void start_secondary(void *unused)
 
     set_current(idle_vcpu[cpu]);
     this_cpu(curr_vcpu) = idle_vcpu[cpu];
+    this_cpu(pgtable_vcpu) = idle_vcpu[cpu];
     rdmsrl(MSR_EFER, this_cpu(efer));
     init_shadow_spec_ctrl_state();
 
diff --git a/xen/common/efi/common-stub.c b/xen/common/efi/common-stub.c
index 5a91fe28ccca..aaeb916c0f69 100644
--- a/xen/common/efi/common-stub.c
+++ b/xen/common/efi/common-stub.c
@@ -7,11 +7,6 @@ bool efi_enabled(unsigned int feature)
     return false;
 }
 
-bool efi_rs_using_pgtables(void)
-{
-    return false;
-}
-
 unsigned long efi_get_time(void)
 {
     BUG();
diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
index 13b0975866e3..a83dfaf3bf4b 100644
--- a/xen/common/efi/runtime.c
+++ b/xen/common/efi/runtime.c
@@ -46,7 +46,6 @@ const CHAR16 *__read_mostly efi_fw_vendor;
 const EFI_RUNTIME_SERVICES *__read_mostly efi_rs;
 #ifndef CONFIG_ARM /* TODO - disabled until implemented on ARM */
 static DEFINE_SPINLOCK(efi_rs_lock);
-static unsigned int efi_rs_on_cpu = NR_CPUS;
 #endif
 
 UINTN __read_mostly efi_memmap_size;
@@ -89,6 +88,11 @@ struct efi_rs_state efi_rs_enter(void)
     if ( mfn_eq(efi_l4_mfn, INVALID_MFN) )
         return state;
 
+    /*
+     * If in lazy idle context switch state sync now to avoid an incoming
+     * FLUSH_VCPU_STATE IPI changing the loaded page-tables.
+     */
+    sync_local_execstate();
     state.cr3 = read_cr3();
     save_fpu_enable();
     asm volatile ( "fnclex; fldcw %0" :: "m" (fcw) );
@@ -96,8 +100,6 @@ struct efi_rs_state efi_rs_enter(void)
 
     spin_lock(&efi_rs_lock);
 
-    efi_rs_on_cpu = smp_processor_id();
-
     /* prevent fixup_page_fault() from doing anything */
     irq_enter();
 
@@ -112,7 +114,8 @@ struct efi_rs_state efi_rs_enter(void)
         lgdt(&gdt_desc);
     }
 
-    switch_cr3_cr4(mfn_to_maddr(efi_l4_mfn), read_cr4());
+    switch_cr3_cr4(idle_vcpu[smp_processor_id()], mfn_to_maddr(efi_l4_mfn),
+                   read_cr4());
 
     /*
      * At the time of writing (2022), no UEFI firwmare is CET-IBT compatible.
@@ -140,7 +143,7 @@ void efi_rs_leave(struct efi_rs_state *state)
     if ( state->msr_s_cet )
         wrmsrl(MSR_S_CET, state->msr_s_cet);
 
-    switch_cr3_cr4(state->cr3, read_cr4());
+    switch_cr3_cr4(curr, state->cr3, read_cr4());
     if ( is_pv_vcpu(curr) && !is_idle_vcpu(curr) )
     {
         struct desc_ptr gdt_desc = {
@@ -151,18 +154,10 @@ void efi_rs_leave(struct efi_rs_state *state)
         lgdt(&gdt_desc);
     }
     irq_exit();
-    efi_rs_on_cpu = NR_CPUS;
     spin_unlock(&efi_rs_lock);
     vcpu_restore_fpu_nonlazy(curr, true);
 }
 
-bool efi_rs_using_pgtables(void)
-{
-    return !mfn_eq(efi_l4_mfn, INVALID_MFN) &&
-           (smp_processor_id() == efi_rs_on_cpu) &&
-           (read_cr3() == mfn_to_maddr(efi_l4_mfn));
-}
-
 unsigned long efi_get_time(void)
 {
     EFI_TIME time;
diff --git a/xen/include/xen/efi.h b/xen/include/xen/efi.h
index 94a7e547f97b..fe2f3b394178 100644
--- a/xen/include/xen/efi.h
+++ b/xen/include/xen/efi.h
@@ -34,7 +34,6 @@ struct compat_pf_efi_runtime_call;
 bool efi_enabled(unsigned int feature);
 void efi_init_memory(void);
 bool efi_boot_mem_unused(unsigned long *start, unsigned long *end);
-bool efi_rs_using_pgtables(void);
 unsigned long efi_get_time(void);
 void efi_halt_system(void);
 void efi_reset_system(bool warm);
From d7ff39d5d30099c733e55d76b0b0499df3faa28d Mon Sep 17 00:00:00 2001
From: Roger Pau Monne <roger.pau@citrix.com>
Date: Mon, 16 Mar 2026 11:03:22 +0100
Subject: [PATCH] x86/mm: accurately track which vCPU page-tables are loaded
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Neither current nor curr_vcpu per-CPU fields accurately track which
page-tables are loaded.  There are corner cases when dealing with shadow
paging failures that switch to the idle vCPU page-tables without changing
current or curr_vcpu per-CPU fields.

Introduce a new per-CPU field that attempts to track which vCPU page-tables
are loaded.  Update such tracking when cr3 is changed, and do so in a
region with interrupts disabled, as to avoid handling interrupts with a
mismatch between the vCPU tracking field and the loaded page-tables.

As a result of this newly more accurate tracking the mapcache override
functionality can be removed: the dom0 PV builder was the only user of it,
and it's updated here to properly signal which vCPU page-tables are loaded
in the calls to switch_cr3_cr4().

Note the EFI page-tables have the Xen owned L4 slots copied from the idle
page-tables, so for the effects of the mapcache the EFI page-tables could
use the idle mapcache if it had one.  Pass the idle vCPU in the
switch_cr3_cr4() call that switches to the runtime EFI page-tables.

There are known issues with the use of mapcache in NMI context.  This patch
does not alter the behaviour.

This is CVE-2026-42488 / XSA-494.

Fixes: fb0ff49fe9f7 ("x86/shadow: defer releasing of PV's top-level shadow reference")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/domain_page.c           | 48 ++++++++++++----------------
 xen/arch/x86/flushtlb.c              |  5 ++-
 xen/arch/x86/include/asm/domain.h    |  1 -
 xen/arch/x86/include/asm/flushtlb.h  |  2 +-
 xen/arch/x86/include/asm/processor.h |  3 ++
 xen/arch/x86/mm.c                    |  4 +--
 xen/arch/x86/pv/dom0_build.c         | 12 +++----
 xen/arch/x86/pv/domain.c             | 13 ++++++--
 xen/arch/x86/smpboot.c               |  1 +
 xen/common/efi/common-stub.c         |  5 ---
 xen/common/efi/runtime.c             | 21 +++++-------
 xen/include/xen/efi.h                |  1 -
 12 files changed, 54 insertions(+), 62 deletions(-)

diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index eac5e3304fb8..72c00194f315 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -18,48 +18,40 @@
 #include <asm/hardirq.h>
 #include <asm/setup.h>
 
-static DEFINE_PER_CPU(struct vcpu *, override);
-
 static inline struct vcpu *mapcache_current_vcpu(void)
 {
-    /* In the common case we use the mapcache of the running VCPU. */
-    struct vcpu *v = this_cpu(override) ?: current;
-
-    /*
-     * When current isn't properly set up yet, this is equivalent to
-     * running in an idle vCPU (callers must check for NULL).
-     */
-    if ( !v )
-        return NULL;
+    struct vcpu *v = this_cpu(pgtable_vcpu);
+    struct vcpu *curr = current;
 
     /*
-     * When using efi runtime page tables, we have the equivalent of the idle
-     * domain's page tables but current may point at another domain's VCPU.
-     * Return NULL as though current is not properly set up yet.
+     * During early boot pgtable_vcpu is not set, callers must handle NULL.
+     * Non-PV domains don't have a mapcache, the directmap covers all physical
+     * address space.
      */
-    if ( efi_rs_using_pgtables() )
+    if ( !v || !is_pv_vcpu(v) )
         return NULL;
 
     /*
-     * If guest_table is NULL, and we are running a paravirtualised guest,
-     * then it means we are running on the idle domain's page table and must
-     * therefore use its mapcache.
+     * If we are in a lazy context-switch state from a PV vCPU do a full switch
+     * to the idle vCPU now, otherwise an incoming FLUSH_VCPU_STATE IPI would
+     * change the page tables under our feet an invalidate any in-use mapcache
+     * entries.
      */
-    if ( unlikely(pagetable_is_null(v->arch.guest_table)) && is_pv_vcpu(v) )
+    if ( unlikely(this_cpu(curr_vcpu) != curr) )
     {
-        /* If we really are idling, perform lazy context switch now. */
-        if ( (v = idle_vcpu[smp_processor_id()]) == current )
-            sync_local_execstate();
+        ASSERT(curr == idle_vcpu[smp_processor_id()]);
+        sync_local_execstate();
         /* We must now be running on the idle page table. */
         ASSERT(cr3_pa(read_cr3()) == __pa(idle_pg_table));
     }
 
-    return v;
-}
-
-void __init mapcache_override_current(struct vcpu *v)
-{
-    this_cpu(override) = v;
+    /*
+     * At this point we can guarantee Xen is not in lazy context switch: either
+     * the code above will have synced the state, or an incoming
+     * FLUSH_VCPU_STATE IPI has done so behind our back.  Use ACCESS_ONCE to
+     * ensure the compiler never returns the locally cached pgtable_vcpu value.
+     */
+    return ACCESS_ONCE(this_cpu(pgtable_vcpu));
 }
 
 #define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER)
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 18748b2bc805..cdefab2f08ec 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -111,7 +111,9 @@ static void do_tlb_flush(void)
     local_irq_restore(flags);
 }
 
-void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
+DEFINE_PER_CPU(struct vcpu *, pgtable_vcpu);
+
+void switch_cr3_cr4(struct vcpu *v, unsigned long cr3, unsigned long cr4)
 {
     unsigned long flags, old_cr4;
     u32 t = 0;
@@ -155,6 +157,7 @@ void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
     if ( (old_cr4 & X86_CR4_PCIDE) > (cr4 & X86_CR4_PCIDE) )
         cr3 |= X86_CR3_NOFLUSH;
     write_cr3(cr3);
+    this_cpu(pgtable_vcpu) = v;
 
     if ( old_cr4 != cr4 )
         write_cr4(cr4);
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 0d2d2b6623c0..85fba8b3c576 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -76,7 +76,6 @@ struct mapcache_domain {
 
 int mapcache_domain_init(struct domain *);
 int mapcache_vcpu_init(struct vcpu *);
-void mapcache_override_current(struct vcpu *);
 
 /* x86/64: toggle guest between kernel and user modes. */
 void toggle_guest_mode(struct vcpu *);
diff --git a/xen/arch/x86/include/asm/flushtlb.h b/xen/arch/x86/include/asm/flushtlb.h
index a461ee36ffeb..821ffe3e8b16 100644
--- a/xen/arch/x86/include/asm/flushtlb.h
+++ b/xen/arch/x86/include/asm/flushtlb.h
@@ -99,7 +99,7 @@ static inline unsigned long read_cr3(void)
 }
 
 /* Write pagetable base and implicitly tick the tlbflush clock. */
-void switch_cr3_cr4(unsigned long cr3, unsigned long cr4);
+void switch_cr3_cr4(struct vcpu *v, unsigned long cr3, unsigned long cr4);
 
 /* flush_* flag fields: */
  /*
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index c5e5c72341ad..c6e0c858bfb6 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -372,6 +372,9 @@ extern idt_entry_t *idt_tables[];
 
 DECLARE_PER_CPU(root_pgentry_t *, root_pgt);
 
+/* vCPU of the currently loaded page-tables. */
+DECLARE_PER_CPU(struct vcpu *, pgtable_vcpu);
+
 extern void write_ptbase(struct vcpu *v);
 
 /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 3635f906847c..484b989042cc 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -528,7 +528,7 @@ void write_ptbase(struct vcpu *v)
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
         if ( new_cr4 & X86_CR4_PCIDE )
             cpu_info->pv_cr3 |= get_pcid_bits(v, true);
-        switch_cr3_cr4(v->arch.cr3, new_cr4);
+        switch_cr3_cr4(v, v->arch.cr3, new_cr4);
     }
     else
     {
@@ -536,7 +536,7 @@ void write_ptbase(struct vcpu *v)
         cpu_info->use_pv_cr3 = false;
         cpu_info->xen_cr3 = 0;
         /* switch_cr3_cr4() serializes. */
-        switch_cr3_cr4(v->arch.cr3, new_cr4);
+        switch_cr3_cr4(v, v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
 }
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 3b28ae45d124..0f9e614ce71f 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -812,8 +812,7 @@ static int __init dom0_construct(struct domain *d,
     update_cr3(v);
 
     /* We run on dom0's page tables for the final part of the build process. */
-    switch_cr3_cr4(cr3_pa(v->arch.cr3), read_cr4());
-    mapcache_override_current(v);
+    switch_cr3_cr4(v, cr3_pa(v->arch.cr3), read_cr4());
 
     /* Copy the OS image and free temporary buffer. */
     elf.dest_base = (void*)vkern_start;
@@ -822,8 +821,7 @@ static int __init dom0_construct(struct domain *d,
     rc = elf_load_binary(&elf);
     if ( rc < 0 )
     {
-        mapcache_override_current(NULL);
-        switch_cr3_cr4(current->arch.cr3, read_cr4());
+        switch_cr3_cr4(current, current->arch.cr3, read_cr4());
         printk("Failed to load the kernel binary\n");
         goto out;
     }
@@ -834,8 +832,7 @@ static int __init dom0_construct(struct domain *d,
         if ( (parms.virt_hypercall < v_start) ||
              (parms.virt_hypercall >= v_end) )
         {
-            mapcache_override_current(NULL);
-            switch_cr3_cr4(current->arch.cr3, read_cr4());
+            switch_cr3_cr4(current, current->arch.cr3, read_cr4());
             printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
             return -EINVAL;
         }
@@ -976,8 +973,7 @@ static int __init dom0_construct(struct domain *d,
 #endif
 
     /* Return to idle domain's page tables. */
-    mapcache_override_current(NULL);
-    switch_cr3_cr4(current->arch.cr3, read_cr4());
+    switch_cr3_cr4(current, current->arch.cr3, read_cr4());
 
     update_domain_wallclock_time(d);
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 2a445bb17b99..5cd467d7a694 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -428,6 +428,8 @@ static void _toggle_guest_pt(struct vcpu *v)
     pagetable_t old_shadow;
     unsigned long cr3;
 
+    ASSERT(local_irq_is_enabled());
+
     v->arch.flags ^= TF_kernel_mode;
     guest_update = v->arch.flags & TF_kernel_mode;
     old_shadow = update_cr3(v);
@@ -450,15 +452,22 @@ static void _toggle_guest_pt(struct vcpu *v)
     {
         cr3 &= ~X86_CR3_NOFLUSH;
 
+        local_irq_disable();
         if ( unlikely(mfn_eq(pagetable_get_mfn(old_shadow),
                              maddr_to_mfn(cr3))) )
         {
-            cr3 = idle_vcpu[v->processor]->arch.cr3;
             /* Also suppress runstate/time area updates below. */
             guest_update = false;
+
+            cr3 = idle_vcpu[v->processor]->arch.cr3;
+            this_cpu(pgtable_vcpu) = idle_vcpu[v->processor];
         }
+
+        write_cr3(cr3);
+        local_irq_enable();
     }
-    write_cr3(cr3);
+    else
+        write_cr3(cr3);
 
     if ( !pagetable_is_null(old_shadow) )
         shadow_put_top_level(v->domain, old_shadow);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index ec4956e10493..a585d4df5ca6 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -324,6 +324,7 @@ void start_secondary(void *unused)
 
     set_current(idle_vcpu[cpu]);
     this_cpu(curr_vcpu) = idle_vcpu[cpu];
+    this_cpu(pgtable_vcpu) = idle_vcpu[cpu];
     rdmsrl(MSR_EFER, this_cpu(efer));
     init_shadow_spec_ctrl_state();
 
diff --git a/xen/common/efi/common-stub.c b/xen/common/efi/common-stub.c
index 5a91fe28ccca..aaeb916c0f69 100644
--- a/xen/common/efi/common-stub.c
+++ b/xen/common/efi/common-stub.c
@@ -7,11 +7,6 @@ bool efi_enabled(unsigned int feature)
     return false;
 }
 
-bool efi_rs_using_pgtables(void)
-{
-    return false;
-}
-
 unsigned long efi_get_time(void)
 {
     BUG();
diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
index 5cb7504c96ad..db21b9a802e6 100644
--- a/xen/common/efi/runtime.c
+++ b/xen/common/efi/runtime.c
@@ -46,7 +46,6 @@ const CHAR16 *__read_mostly efi_fw_vendor;
 const EFI_RUNTIME_SERVICES *__read_mostly efi_rs;
 #ifndef CONFIG_ARM /* TODO - disabled until implemented on ARM */
 static DEFINE_SPINLOCK(efi_rs_lock);
-static unsigned int efi_rs_on_cpu = NR_CPUS;
 #endif
 
 UINTN __read_mostly efi_memmap_size;
@@ -89,6 +88,11 @@ struct efi_rs_state efi_rs_enter(void)
     if ( mfn_eq(efi_l4_mfn, INVALID_MFN) )
         return state;
 
+    /*
+     * If in lazy idle context switch state sync now to avoid an incoming
+     * FLUSH_VCPU_STATE IPI changing the loaded page-tables.
+     */
+    sync_local_execstate();
     state.cr3 = read_cr3();
     save_fpu_enable();
     asm volatile ( "fnclex; fldcw %0" :: "m" (fcw) );
@@ -96,8 +100,6 @@ struct efi_rs_state efi_rs_enter(void)
 
     spin_lock(&efi_rs_lock);
 
-    efi_rs_on_cpu = smp_processor_id();
-
     /* prevent fixup_page_fault() from doing anything */
     irq_enter();
 
@@ -112,7 +114,8 @@ struct efi_rs_state efi_rs_enter(void)
         lgdt(&gdt_desc);
     }
 
-    switch_cr3_cr4(mfn_to_maddr(efi_l4_mfn), read_cr4());
+    switch_cr3_cr4(idle_vcpu[smp_processor_id()], mfn_to_maddr(efi_l4_mfn),
+                   read_cr4());
 
     /*
      * At the time of writing (2022), no UEFI firwmare is CET-IBT compatible.
@@ -140,7 +143,7 @@ void efi_rs_leave(struct efi_rs_state *state)
     if ( state->msr_s_cet )
         wrmsrl(MSR_S_CET, state->msr_s_cet);
 
-    switch_cr3_cr4(state->cr3, read_cr4());
+    switch_cr3_cr4(curr, state->cr3, read_cr4());
     if ( is_pv_vcpu(curr) && !is_idle_vcpu(curr) )
     {
         struct desc_ptr gdt_desc = {
@@ -151,18 +154,10 @@ void efi_rs_leave(struct efi_rs_state *state)
         lgdt(&gdt_desc);
     }
     irq_exit();
-    efi_rs_on_cpu = NR_CPUS;
     spin_unlock(&efi_rs_lock);
     vcpu_restore_fpu_nonlazy(curr, true);
 }
 
-bool efi_rs_using_pgtables(void)
-{
-    return !mfn_eq(efi_l4_mfn, INVALID_MFN) &&
-           (smp_processor_id() == efi_rs_on_cpu) &&
-           (read_cr3() == mfn_to_maddr(efi_l4_mfn));
-}
-
 unsigned long efi_get_time(void)
 {
     EFI_TIME time;
diff --git a/xen/include/xen/efi.h b/xen/include/xen/efi.h
index 942d2e9491e9..383b382fb057 100644
--- a/xen/include/xen/efi.h
+++ b/xen/include/xen/efi.h
@@ -34,7 +34,6 @@ struct compat_pf_efi_runtime_call;
 bool efi_enabled(unsigned int feature);
 void efi_init_memory(void);
 bool efi_boot_mem_unused(unsigned long *start, unsigned long *end);
-bool efi_rs_using_pgtables(void);
 unsigned long efi_get_time(void);
 void efi_halt_system(void);
 void efi_reset_system(bool warm);
-- 
2.53.0

From 161ddee42ea7dc1a36015c156d0736696f91020d Mon Sep 17 00:00:00 2001
From: Roger Pau Monne <roger.pau@citrix.com>
Date: Mon, 16 Mar 2026 11:03:22 +0100
Subject: [PATCH] x86/mm: accurately track which vCPU page-tables are loaded
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Neither current nor curr_vcpu per-CPU fields accurately track which
page-tables are loaded.  There are corner cases when dealing with shadow
paging failures that switch to the idle vCPU page-tables without changing
current or curr_vcpu per-CPU fields.

Introduce a new per-CPU field that attempts to track which vCPU page-tables
are loaded.  Update such tracking when cr3 is changed, and do so in a
region with interrupts disabled, as to avoid handling interrupts with a
mismatch between the vCPU tracking field and the loaded page-tables.

As a result of this newly more accurate tracking the mapcache override
functionality can be removed: the dom0 PV builder was the only user of it,
and it's updated here to properly signal which vCPU page-tables are loaded
in the calls to switch_cr3_cr4().

Note the EFI page-tables have the Xen owned L4 slots copied from the idle
page-tables, so for the effects of the mapcache the EFI page-tables could
use the idle mapcache if it had one.  Pass the idle vCPU in the
switch_cr3_cr4() call that switches to the runtime EFI page-tables.

There are known issues with the use of mapcache in NMI context.  This patch
does not alter the behaviour.

This is CVE-2026-42488 / XSA-494.

Fixes: fb0ff49fe9f7 ("x86/shadow: defer releasing of PV's top-level shadow reference")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/domain_page.c           | 48 ++++++++++++----------------
 xen/arch/x86/flushtlb.c              |  5 ++-
 xen/arch/x86/include/asm/domain.h    |  1 -
 xen/arch/x86/include/asm/flushtlb.h  |  2 +-
 xen/arch/x86/include/asm/processor.h |  3 ++
 xen/arch/x86/mm.c                    |  4 +--
 xen/arch/x86/pv/dom0_build.c         | 12 +++----
 xen/arch/x86/pv/domain.c             | 13 ++++++--
 xen/arch/x86/smpboot.c               |  1 +
 xen/common/efi/common-stub.c         |  5 ---
 xen/common/efi/runtime.c             | 21 +++++-------
 xen/include/xen/efi.h                |  1 -
 12 files changed, 54 insertions(+), 62 deletions(-)

diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index eac5e3304fb8..72c00194f315 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -18,48 +18,40 @@
 #include <asm/hardirq.h>
 #include <asm/setup.h>
 
-static DEFINE_PER_CPU(struct vcpu *, override);
-
 static inline struct vcpu *mapcache_current_vcpu(void)
 {
-    /* In the common case we use the mapcache of the running VCPU. */
-    struct vcpu *v = this_cpu(override) ?: current;
-
-    /*
-     * When current isn't properly set up yet, this is equivalent to
-     * running in an idle vCPU (callers must check for NULL).
-     */
-    if ( !v )
-        return NULL;
+    struct vcpu *v = this_cpu(pgtable_vcpu);
+    struct vcpu *curr = current;
 
     /*
-     * When using efi runtime page tables, we have the equivalent of the idle
-     * domain's page tables but current may point at another domain's VCPU.
-     * Return NULL as though current is not properly set up yet.
+     * During early boot pgtable_vcpu is not set, callers must handle NULL.
+     * Non-PV domains don't have a mapcache, the directmap covers all physical
+     * address space.
      */
-    if ( efi_rs_using_pgtables() )
+    if ( !v || !is_pv_vcpu(v) )
         return NULL;
 
     /*
-     * If guest_table is NULL, and we are running a paravirtualised guest,
-     * then it means we are running on the idle domain's page table and must
-     * therefore use its mapcache.
+     * If we are in a lazy context-switch state from a PV vCPU do a full switch
+     * to the idle vCPU now, otherwise an incoming FLUSH_VCPU_STATE IPI would
+     * change the page tables under our feet an invalidate any in-use mapcache
+     * entries.
      */
-    if ( unlikely(pagetable_is_null(v->arch.guest_table)) && is_pv_vcpu(v) )
+    if ( unlikely(this_cpu(curr_vcpu) != curr) )
     {
-        /* If we really are idling, perform lazy context switch now. */
-        if ( (v = idle_vcpu[smp_processor_id()]) == current )
-            sync_local_execstate();
+        ASSERT(curr == idle_vcpu[smp_processor_id()]);
+        sync_local_execstate();
         /* We must now be running on the idle page table. */
         ASSERT(cr3_pa(read_cr3()) == __pa(idle_pg_table));
     }
 
-    return v;
-}
-
-void __init mapcache_override_current(struct vcpu *v)
-{
-    this_cpu(override) = v;
+    /*
+     * At this point we can guarantee Xen is not in lazy context switch: either
+     * the code above will have synced the state, or an incoming
+     * FLUSH_VCPU_STATE IPI has done so behind our back.  Use ACCESS_ONCE to
+     * ensure the compiler never returns the locally cached pgtable_vcpu value.
+     */
+    return ACCESS_ONCE(this_cpu(pgtable_vcpu));
 }
 
 #define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER)
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 65be0474a8ea..16f1fab5c5e6 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -111,7 +111,9 @@ static void do_tlb_flush(void)
     local_irq_restore(flags);
 }
 
-void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
+DEFINE_PER_CPU(struct vcpu *, pgtable_vcpu);
+
+void switch_cr3_cr4(struct vcpu *v, unsigned long cr3, unsigned long cr4)
 {
     unsigned long flags, old_cr4;
     u32 t = 0;
@@ -155,6 +157,7 @@ void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
     if ( (old_cr4 & X86_CR4_PCIDE) > (cr4 & X86_CR4_PCIDE) )
         cr3 |= X86_CR3_NOFLUSH;
     write_cr3(cr3);
+    this_cpu(pgtable_vcpu) = v;
 
     if ( old_cr4 != cr4 )
         write_cr4(cr4);
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index b79d6badd71c..f0370bc7bb12 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -75,7 +75,6 @@ struct mapcache_domain {
 
 int mapcache_domain_init(struct domain *d);
 int mapcache_vcpu_init(struct vcpu *v);
-void mapcache_override_current(struct vcpu *v);
 
 /* x86/64: toggle guest between kernel and user modes. */
 void toggle_guest_mode(struct vcpu *v);
diff --git a/xen/arch/x86/include/asm/flushtlb.h b/xen/arch/x86/include/asm/flushtlb.h
index bb0ad58db49b..75e291d93bf6 100644
--- a/xen/arch/x86/include/asm/flushtlb.h
+++ b/xen/arch/x86/include/asm/flushtlb.h
@@ -99,7 +99,7 @@ static inline unsigned long read_cr3(void)
 }
 
 /* Write pagetable base and implicitly tick the tlbflush clock. */
-void switch_cr3_cr4(unsigned long cr3, unsigned long cr4);
+void switch_cr3_cr4(struct vcpu *v, unsigned long cr3, unsigned long cr4);
 
 /* flush_* flag fields: */
  /*
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index 98734f4d3ff3..4b52d68a6f87 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -375,6 +375,9 @@ extern idt_entry_t *idt_tables[];
 
 DECLARE_PER_CPU(root_pgentry_t *, root_pgt);
 
+/* vCPU of the currently loaded page-tables. */
+DECLARE_PER_CPU(struct vcpu *, pgtable_vcpu);
+
 extern void write_ptbase(struct vcpu *v);
 
 /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 3430b13dcd2c..23496407f2b9 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -542,7 +542,7 @@ void write_ptbase(struct vcpu *v)
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
         if ( new_cr4 & X86_CR4_PCIDE )
             cpu_info->pv_cr3 |= get_pcid_bits(v, true);
-        switch_cr3_cr4(v->arch.cr3, new_cr4);
+        switch_cr3_cr4(v, v->arch.cr3, new_cr4);
     }
     else
     {
@@ -550,7 +550,7 @@ void write_ptbase(struct vcpu *v)
         cpu_info->use_pv_cr3 = false;
         cpu_info->xen_cr3 = 0;
         /* switch_cr3_cr4() serializes. */
-        switch_cr3_cr4(v->arch.cr3, new_cr4);
+        switch_cr3_cr4(v, v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
 }
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 5bc59b48a5a8..7ce82f199b3f 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -836,8 +836,7 @@ static int __init dom0_construct(struct boot_info *bi, struct domain *d)
     update_cr3(v);
 
     /* We run on dom0's page tables for the final part of the build process. */
-    switch_cr3_cr4(cr3_pa(v->arch.cr3), read_cr4());
-    mapcache_override_current(v);
+    switch_cr3_cr4(v, cr3_pa(v->arch.cr3), read_cr4());
 
     /* Copy the OS image and free temporary buffer. */
     elf.dest_base = (void*)vkern_start;
@@ -846,8 +845,7 @@ static int __init dom0_construct(struct boot_info *bi, struct domain *d)
     rc = elf_load_binary(&elf);
     if ( rc < 0 )
     {
-        mapcache_override_current(NULL);
-        switch_cr3_cr4(current->arch.cr3, read_cr4());
+        switch_cr3_cr4(current, current->arch.cr3, read_cr4());
         printk("Failed to load the kernel binary\n");
         goto out;
     }
@@ -858,8 +856,7 @@ static int __init dom0_construct(struct boot_info *bi, struct domain *d)
         if ( (parms.virt_hypercall < v_start) ||
              (parms.virt_hypercall >= v_end) )
         {
-            mapcache_override_current(NULL);
-            switch_cr3_cr4(current->arch.cr3, read_cr4());
+            switch_cr3_cr4(current, current->arch.cr3, read_cr4());
             printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
             return -EINVAL;
         }
@@ -1000,8 +997,7 @@ static int __init dom0_construct(struct boot_info *bi, struct domain *d)
 #endif
 
     /* Return to idle domain's page tables. */
-    mapcache_override_current(NULL);
-    switch_cr3_cr4(current->arch.cr3, read_cr4());
+    switch_cr3_cr4(current, current->arch.cr3, read_cr4());
 
     update_domain_wallclock_time(d);
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 745c1dbb217a..0f45ccafc268 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -449,6 +449,8 @@ static void _toggle_guest_pt(struct vcpu *v)
     pagetable_t old_shadow;
     unsigned long cr3;
 
+    ASSERT(local_irq_is_enabled());
+
     v->arch.flags ^= TF_kernel_mode;
     guest_update = v->arch.flags & TF_kernel_mode;
     old_shadow = update_cr3(v);
@@ -471,15 +473,22 @@ static void _toggle_guest_pt(struct vcpu *v)
     {
         cr3 &= ~X86_CR3_NOFLUSH;
 
+        local_irq_disable();
         if ( unlikely(mfn_eq(pagetable_get_mfn(old_shadow),
                              maddr_to_mfn(cr3))) )
         {
-            cr3 = idle_vcpu[v->processor]->arch.cr3;
             /* Also suppress runstate/time area updates below. */
             guest_update = false;
+
+            cr3 = idle_vcpu[v->processor]->arch.cr3;
+            this_cpu(pgtable_vcpu) = idle_vcpu[v->processor];
         }
+
+        write_cr3(cr3);
+        local_irq_enable();
     }
-    write_cr3(cr3);
+    else
+        write_cr3(cr3);
 
     if ( !pagetable_is_null(old_shadow) )
         shadow_put_top_level(v->domain, old_shadow);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 8742e3056141..fc0761150ffe 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -330,6 +330,7 @@ void asmlinkage start_secondary(void *unused)
 
     set_current(idle_vcpu[cpu]);
     this_cpu(curr_vcpu) = idle_vcpu[cpu];
+    this_cpu(pgtable_vcpu) = idle_vcpu[cpu];
     rdmsrl(MSR_EFER, this_cpu(efer));
     init_shadow_spec_ctrl_state();
 
diff --git a/xen/common/efi/common-stub.c b/xen/common/efi/common-stub.c
index 77f138a6c574..7b12005bea3f 100644
--- a/xen/common/efi/common-stub.c
+++ b/xen/common/efi/common-stub.c
@@ -7,11 +7,6 @@ bool efi_enabled(unsigned int feature)
     return false;
 }
 
-bool efi_rs_using_pgtables(void)
-{
-    return false;
-}
-
 unsigned long efi_get_time(void)
 {
     BUG();
diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
index 7e1fce291d92..af7f96fb7dd0 100644
--- a/xen/common/efi/runtime.c
+++ b/xen/common/efi/runtime.c
@@ -47,7 +47,6 @@ const CHAR16 *__read_mostly efi_fw_vendor;
 const EFI_RUNTIME_SERVICES *__read_mostly efi_rs;
 #ifndef CONFIG_ARM /* TODO - disabled until implemented on ARM */
 static DEFINE_SPINLOCK(efi_rs_lock);
-static unsigned int efi_rs_on_cpu = NR_CPUS;
 #endif
 
 UINTN __read_mostly efi_memmap_size;
@@ -90,6 +89,11 @@ struct efi_rs_state efi_rs_enter(void)
     if ( mfn_eq(efi_l4_mfn, INVALID_MFN) )
         return state;
 
+    /*
+     * If in lazy idle context switch state sync now to avoid an incoming
+     * FLUSH_VCPU_STATE IPI changing the loaded page-tables.
+     */
+    sync_local_execstate();
     state.cr3 = read_cr3();
     save_fpu_enable();
     asm volatile ( "fnclex; fldcw %0" :: "m" (fcw) );
@@ -97,8 +101,6 @@ struct efi_rs_state efi_rs_enter(void)
 
     spin_lock(&efi_rs_lock);
 
-    efi_rs_on_cpu = smp_processor_id();
-
     /* prevent fixup_page_fault() from doing anything */
     irq_enter();
 
@@ -113,7 +115,8 @@ struct efi_rs_state efi_rs_enter(void)
         lgdt(&gdt_desc);
     }
 
-    switch_cr3_cr4(mfn_to_maddr(efi_l4_mfn), read_cr4());
+    switch_cr3_cr4(idle_vcpu[smp_processor_id()], mfn_to_maddr(efi_l4_mfn),
+                   read_cr4());
 
     /*
      * At the time of writing (2022), no UEFI firwmare is CET-IBT compatible.
@@ -141,7 +144,7 @@ void efi_rs_leave(struct efi_rs_state *state)
     if ( state->msr_s_cet )
         wrmsrl(MSR_S_CET, state->msr_s_cet);
 
-    switch_cr3_cr4(state->cr3, read_cr4());
+    switch_cr3_cr4(curr, state->cr3, read_cr4());
     if ( is_pv_vcpu(curr) && !is_idle_vcpu(curr) )
     {
         struct desc_ptr gdt_desc = {
@@ -152,18 +155,10 @@ void efi_rs_leave(struct efi_rs_state *state)
         lgdt(&gdt_desc);
     }
     irq_exit();
-    efi_rs_on_cpu = NR_CPUS;
     spin_unlock(&efi_rs_lock);
     vcpu_restore_fpu_nonlazy(curr, true);
 }
 
-bool efi_rs_using_pgtables(void)
-{
-    return !mfn_eq(efi_l4_mfn, INVALID_MFN) &&
-           (smp_processor_id() == efi_rs_on_cpu) &&
-           (read_cr3() == mfn_to_maddr(efi_l4_mfn));
-}
-
 unsigned long efi_get_time(void)
 {
     EFI_TIME time;
diff --git a/xen/include/xen/efi.h b/xen/include/xen/efi.h
index 160804e29444..356be1705a54 100644
--- a/xen/include/xen/efi.h
+++ b/xen/include/xen/efi.h
@@ -42,7 +42,6 @@ static inline bool efi_enabled(unsigned int feature)
 
 void efi_init_memory(void);
 bool efi_boot_mem_unused(unsigned long *start, unsigned long *end);
-bool efi_rs_using_pgtables(void);
 unsigned long efi_get_time(void);
 void efi_halt_system(void);
 void efi_reset_system(bool warm);
-- 
2.53.0

From 579016a359741044c9076bf0884e1dbab00ab080 Mon Sep 17 00:00:00 2001
From: Roger Pau Monne <roger.pau@citrix.com>
Date: Mon, 16 Mar 2026 11:03:22 +0100
Subject: [PATCH] x86/mm: accurately track which vCPU page-tables are loaded
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Neither current nor curr_vcpu per-CPU fields accurately track which
page-tables are loaded.  There are corner cases when dealing with shadow
paging failures that switch to the idle vCPU page-tables without changing
current or curr_vcpu per-CPU fields.

Introduce a new per-CPU field that attempts to track which vCPU page-tables
are loaded.  Update such tracking when cr3 is changed, and do so in a
region with interrupts disabled, as to avoid handling interrupts with a
mismatch between the vCPU tracking field and the loaded page-tables.

As a result of this newly more accurate tracking the mapcache override
functionality can be removed: the dom0 PV builder was the only user of it,
and it's updated here to properly signal which vCPU page-tables are loaded
in the calls to switch_cr3_cr4().

Note the EFI page-tables have the Xen owned L4 slots copied from the idle
page-tables, so for the effects of the mapcache the EFI page-tables could
use the idle mapcache if it had one.  Pass the idle vCPU in the
switch_cr3_cr4() call that switches to the runtime EFI page-tables.

There are known issues with the use of mapcache in NMI context.  This patch
does not alter the behaviour.

This is CVE-2026-42488 / XSA-494.

Fixes: fb0ff49fe9f7 ("x86/shadow: defer releasing of PV's top-level shadow reference")
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/domain_page.c           | 48 ++++++++++++----------------
 xen/arch/x86/flushtlb.c              |  5 ++-
 xen/arch/x86/include/asm/domain.h    |  1 -
 xen/arch/x86/include/asm/flushtlb.h  |  2 +-
 xen/arch/x86/include/asm/processor.h |  3 ++
 xen/arch/x86/mm.c                    |  4 +--
 xen/arch/x86/pv/dom0_build.c         | 12 +++----
 xen/arch/x86/pv/domain.c             | 13 ++++++--
 xen/arch/x86/smpboot.c               |  1 +
 xen/common/efi/common-stub.c         |  5 ---
 xen/common/efi/runtime.c             | 21 +++++-------
 xen/include/xen/efi.h                |  1 -
 12 files changed, 54 insertions(+), 62 deletions(-)

diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c
index eac5e3304fb8..72c00194f315 100644
--- a/xen/arch/x86/domain_page.c
+++ b/xen/arch/x86/domain_page.c
@@ -18,48 +18,40 @@
 #include <asm/hardirq.h>
 #include <asm/setup.h>
 
-static DEFINE_PER_CPU(struct vcpu *, override);
-
 static inline struct vcpu *mapcache_current_vcpu(void)
 {
-    /* In the common case we use the mapcache of the running VCPU. */
-    struct vcpu *v = this_cpu(override) ?: current;
-
-    /*
-     * When current isn't properly set up yet, this is equivalent to
-     * running in an idle vCPU (callers must check for NULL).
-     */
-    if ( !v )
-        return NULL;
+    struct vcpu *v = this_cpu(pgtable_vcpu);
+    struct vcpu *curr = current;
 
     /*
-     * When using efi runtime page tables, we have the equivalent of the idle
-     * domain's page tables but current may point at another domain's VCPU.
-     * Return NULL as though current is not properly set up yet.
+     * During early boot pgtable_vcpu is not set, callers must handle NULL.
+     * Non-PV domains don't have a mapcache, the directmap covers all physical
+     * address space.
      */
-    if ( efi_rs_using_pgtables() )
+    if ( !v || !is_pv_vcpu(v) )
         return NULL;
 
     /*
-     * If guest_table is NULL, and we are running a paravirtualised guest,
-     * then it means we are running on the idle domain's page table and must
-     * therefore use its mapcache.
+     * If we are in a lazy context-switch state from a PV vCPU do a full switch
+     * to the idle vCPU now, otherwise an incoming FLUSH_VCPU_STATE IPI would
+     * change the page tables under our feet an invalidate any in-use mapcache
+     * entries.
      */
-    if ( unlikely(pagetable_is_null(v->arch.guest_table)) && is_pv_vcpu(v) )
+    if ( unlikely(this_cpu(curr_vcpu) != curr) )
     {
-        /* If we really are idling, perform lazy context switch now. */
-        if ( (v = idle_vcpu[smp_processor_id()]) == current )
-            sync_local_execstate();
+        ASSERT(curr == idle_vcpu[smp_processor_id()]);
+        sync_local_execstate();
         /* We must now be running on the idle page table. */
         ASSERT(cr3_pa(read_cr3()) == __pa(idle_pg_table));
     }
 
-    return v;
-}
-
-void __init mapcache_override_current(struct vcpu *v)
-{
-    this_cpu(override) = v;
+    /*
+     * At this point we can guarantee Xen is not in lazy context switch: either
+     * the code above will have synced the state, or an incoming
+     * FLUSH_VCPU_STATE IPI has done so behind our back.  Use ACCESS_ONCE to
+     * ensure the compiler never returns the locally cached pgtable_vcpu value.
+     */
+    return ACCESS_ONCE(this_cpu(pgtable_vcpu));
 }
 
 #define mapcache_l2_entry(e) ((e) >> PAGETABLE_ORDER)
diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c
index 09e676c151fa..928bca66b433 100644
--- a/xen/arch/x86/flushtlb.c
+++ b/xen/arch/x86/flushtlb.c
@@ -111,7 +111,9 @@ static void do_tlb_flush(void)
     local_irq_restore(flags);
 }
 
-void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
+DEFINE_PER_CPU(struct vcpu *, pgtable_vcpu);
+
+void switch_cr3_cr4(struct vcpu *v, unsigned long cr3, unsigned long cr4)
 {
     unsigned long flags, old_cr4;
     u32 t = 0;
@@ -155,6 +157,7 @@ void switch_cr3_cr4(unsigned long cr3, unsigned long cr4)
     if ( (old_cr4 & X86_CR4_PCIDE) > (cr4 & X86_CR4_PCIDE) )
         cr3 |= X86_CR3_NOFLUSH;
     write_cr3(cr3);
+    this_cpu(pgtable_vcpu) = v;
 
     if ( old_cr4 != cr4 )
         write_cr4(cr4);
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index 828f42c3e448..10d2b9fe2546 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -75,7 +75,6 @@ struct mapcache_domain {
 
 int mapcache_domain_init(struct domain *d);
 int mapcache_vcpu_init(struct vcpu *v);
-void mapcache_override_current(struct vcpu *v);
 
 /* x86/64: toggle guest between kernel and user modes. */
 void toggle_guest_mode(struct vcpu *v);
diff --git a/xen/arch/x86/include/asm/flushtlb.h b/xen/arch/x86/include/asm/flushtlb.h
index 7bcbca2b7f31..345677eb72ae 100644
--- a/xen/arch/x86/include/asm/flushtlb.h
+++ b/xen/arch/x86/include/asm/flushtlb.h
@@ -104,7 +104,7 @@ static inline void invlpg(const void *p)
 }
 
 /* Write pagetable base and implicitly tick the tlbflush clock. */
-void switch_cr3_cr4(unsigned long cr3, unsigned long cr4);
+void switch_cr3_cr4(struct vcpu *v, unsigned long cr3, unsigned long cr4);
 
 /* flush_* flag fields: */
  /*
diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h
index 2e087c625770..d2cacdfedb74 100644
--- a/xen/arch/x86/include/asm/processor.h
+++ b/xen/arch/x86/include/asm/processor.h
@@ -328,6 +328,9 @@ DECLARE_PER_CPU(struct tss_page, tss_page);
 
 DECLARE_PER_CPU(root_pgentry_t *, root_pgt);
 
+/* vCPU of the currently loaded page-tables. */
+DECLARE_PER_CPU(struct vcpu *, pgtable_vcpu);
+
 extern void write_ptbase(struct vcpu *v);
 
 /* PAUSE (encoding: REP NOP) is a good thing to insert into busy-wait loops. */
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 2b23bf2e7a75..d02c9862d387 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -535,7 +535,7 @@ void write_ptbase(struct vcpu *v)
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
         if ( new_cr4 & X86_CR4_PCIDE )
             cpu_info->pv_cr3 |= get_pcid_bits(v, true);
-        switch_cr3_cr4(v->arch.cr3, new_cr4);
+        switch_cr3_cr4(v, v->arch.cr3, new_cr4);
     }
     else
     {
@@ -543,7 +543,7 @@ void write_ptbase(struct vcpu *v)
         cpu_info->use_pv_cr3 = false;
         cpu_info->xen_cr3 = 0;
         /* switch_cr3_cr4() serializes. */
-        switch_cr3_cr4(v->arch.cr3, new_cr4);
+        switch_cr3_cr4(v, v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
 }
diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index 37729091dfaa..42bc530c0f0d 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -828,8 +828,7 @@ static int __init dom0_construct(const struct boot_domain *bd)
     update_cr3(v);
 
     /* We run on dom0's page tables for the final part of the build process. */
-    switch_cr3_cr4(cr3_pa(v->arch.cr3), read_cr4());
-    mapcache_override_current(v);
+    switch_cr3_cr4(v, cr3_pa(v->arch.cr3), read_cr4());
 
     /* Copy the OS image and free temporary buffer. */
     elf.dest_base = (void*)vkern_start;
@@ -838,8 +837,7 @@ static int __init dom0_construct(const struct boot_domain *bd)
     rc = elf_load_binary(&elf);
     if ( rc < 0 )
     {
-        mapcache_override_current(NULL);
-        switch_cr3_cr4(current->arch.cr3, read_cr4());
+        switch_cr3_cr4(current, current->arch.cr3, read_cr4());
         printk("Failed to load the kernel binary\n");
         goto out;
     }
@@ -850,8 +848,7 @@ static int __init dom0_construct(const struct boot_domain *bd)
         if ( (parms.virt_hypercall < v_start) ||
              (parms.virt_hypercall >= v_end) )
         {
-            mapcache_override_current(NULL);
-            switch_cr3_cr4(current->arch.cr3, read_cr4());
+            switch_cr3_cr4(current, current->arch.cr3, read_cr4());
             printk("Invalid HYPERCALL_PAGE field in ELF notes.\n");
             return -EINVAL;
         }
@@ -992,8 +989,7 @@ static int __init dom0_construct(const struct boot_domain *bd)
 #endif
 
     /* Return to idle domain's page tables. */
-    mapcache_override_current(NULL);
-    switch_cr3_cr4(current->arch.cr3, read_cr4());
+    switch_cr3_cr4(current, current->arch.cr3, read_cr4());
 
     update_domain_wallclock_time(d);
 
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index ef4f442e7332..d9e52f5f88f3 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -451,6 +451,8 @@ static void _toggle_guest_pt(struct vcpu *v)
     pagetable_t old_shadow;
     unsigned long cr3;
 
+    ASSERT(local_irq_is_enabled());
+
     v->arch.flags ^= TF_kernel_mode;
     guest_update = v->arch.flags & TF_kernel_mode;
     old_shadow = update_cr3(v);
@@ -473,15 +475,22 @@ static void _toggle_guest_pt(struct vcpu *v)
     {
         cr3 &= ~X86_CR3_NOFLUSH;
 
+        local_irq_disable();
         if ( unlikely(mfn_eq(pagetable_get_mfn(old_shadow),
                              maddr_to_mfn(cr3))) )
         {
-            cr3 = idle_vcpu[v->processor]->arch.cr3;
             /* Also suppress runstate/time area updates below. */
             guest_update = false;
+
+            cr3 = idle_vcpu[v->processor]->arch.cr3;
+            this_cpu(pgtable_vcpu) = idle_vcpu[v->processor];
         }
+
+        write_cr3(cr3);
+        local_irq_enable();
     }
-    write_cr3(cr3);
+    else
+        write_cr3(cr3);
 
     if ( !pagetable_is_null(old_shadow) )
         shadow_put_top_level(v->domain, old_shadow);
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 27628800a821..b37feab3bef4 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -1063,6 +1063,7 @@ static int cpu_smpboot_alloc(unsigned int cpu)
 
     info->current_vcpu = idle_vcpu[cpu]; /* set_current() */
     per_cpu(curr_vcpu, cpu) = idle_vcpu[cpu];
+    per_cpu(pgtable_vcpu, cpu) = idle_vcpu[cpu];
 
     gdt = per_cpu(gdt, cpu) ?: alloc_xenheap_pages(0, memflags);
     if ( gdt == NULL )
diff --git a/xen/common/efi/common-stub.c b/xen/common/efi/common-stub.c
index 77f138a6c574..7b12005bea3f 100644
--- a/xen/common/efi/common-stub.c
+++ b/xen/common/efi/common-stub.c
@@ -7,11 +7,6 @@ bool efi_enabled(unsigned int feature)
     return false;
 }
 
-bool efi_rs_using_pgtables(void)
-{
-    return false;
-}
-
 unsigned long efi_get_time(void)
 {
     BUG();
diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c
index 30d649ca5c1b..feb09acf754c 100644
--- a/xen/common/efi/runtime.c
+++ b/xen/common/efi/runtime.c
@@ -49,7 +49,6 @@ const CHAR16 *__read_mostly efi_fw_vendor;
 const EFI_RUNTIME_SERVICES *__read_mostly efi_rs;
 #ifndef CONFIG_ARM /* TODO - disabled until implemented on ARM */
 static DEFINE_SPINLOCK(efi_rs_lock);
-static unsigned int efi_rs_on_cpu = NR_CPUS;
 #endif
 
 UINTN __read_mostly efi_memmap_size;
@@ -92,6 +91,11 @@ struct efi_rs_state efi_rs_enter(void)
     if ( mfn_eq(efi_l4_mfn, INVALID_MFN) )
         return state;
 
+    /*
+     * If in lazy idle context switch state sync now to avoid an incoming
+     * FLUSH_VCPU_STATE IPI changing the loaded page-tables.
+     */
+    sync_local_execstate();
     state.cr3 = read_cr3();
     save_fpu_enable();
     asm volatile ( "fnclex; fldcw %0" :: "m" (fcw) );
@@ -99,8 +103,6 @@ struct efi_rs_state efi_rs_enter(void)
 
     spin_lock(&efi_rs_lock);
 
-    efi_rs_on_cpu = smp_processor_id();
-
     /* prevent fixup_page_fault() from doing anything */
     irq_enter();
 
@@ -115,7 +117,8 @@ struct efi_rs_state efi_rs_enter(void)
         lgdt(&gdt_desc);
     }
 
-    switch_cr3_cr4(mfn_to_maddr(efi_l4_mfn), read_cr4());
+    switch_cr3_cr4(idle_vcpu[smp_processor_id()], mfn_to_maddr(efi_l4_mfn),
+                   read_cr4());
 
     /*
      * At the time of writing (2022), no UEFI firwmare is CET-IBT compatible.
@@ -143,7 +146,7 @@ void efi_rs_leave(struct efi_rs_state *state)
     if ( state->msr_s_cet )
         wrmsrl(MSR_S_CET, state->msr_s_cet);
 
-    switch_cr3_cr4(state->cr3, read_cr4());
+    switch_cr3_cr4(curr, state->cr3, read_cr4());
     if ( is_pv_vcpu(curr) && !is_idle_vcpu(curr) )
     {
         struct desc_ptr gdt_desc = {
@@ -154,18 +157,10 @@ void efi_rs_leave(struct efi_rs_state *state)
         lgdt(&gdt_desc);
     }
     irq_exit();
-    efi_rs_on_cpu = NR_CPUS;
     spin_unlock(&efi_rs_lock);
     vcpu_restore_fpu_nonlazy(curr, true);
 }
 
-bool efi_rs_using_pgtables(void)
-{
-    return !mfn_eq(efi_l4_mfn, INVALID_MFN) &&
-           (smp_processor_id() == efi_rs_on_cpu) &&
-           (read_cr3() == mfn_to_maddr(efi_l4_mfn));
-}
-
 unsigned long efi_get_time(void)
 {
     EFI_TIME time;
diff --git a/xen/include/xen/efi.h b/xen/include/xen/efi.h
index 723cb8085270..9953197ee553 100644
--- a/xen/include/xen/efi.h
+++ b/xen/include/xen/efi.h
@@ -40,7 +40,6 @@ extern bool efi_secure_boot;
 
 void efi_init_memory(void);
 bool efi_boot_mem_unused(unsigned long *start, unsigned long *end);
-bool efi_rs_using_pgtables(void);
 unsigned long efi_get_time(void);
 void efi_halt_system(void);
 void efi_reset_system(bool warm);
-- 
2.53.0