:p
atchew
Login
Hello, The aim of this series is to introduce the functionality required to create linear mappings visible to a single pCPU. Doing so requires having a per-CPU root page-table (L4), and hence requires changes to the HVM monitor tables and shadowing the guest selected L4 on PV guests. As follow ups (and partially to ensure the per-CPU mappings work fine) the CPU stacks are switched to use per-CPU mappings, so that remote stack contents are not by default mapped on all page-tables (note: for this to be true the directmap entries for the stack pages would need to be removed also). Patches before patch 12 are either small fixes or preparatory non-functional changes in order to accommodate the rest of the series. Patch 12 introduces a new 'asi' spec-ctrl option, that's used to enable Address Space Isolation. Patches 13-15 and 20 introduce logic to use per-CPU L4 on HVM and PV guests. Patches 16-18 add support for creating per-CPU mappings to the existing page-table management functions, map_pages_to_xen() and related functions. Patch 19 introduce helpers for creating per-CPU mappings using a fixmap interface. Finally patches 21-22 add support for mapping the CPU stack in a per-CPU fixmap region, and zeroing the stacks on guest context switch. I've been testing the patches quite a lot using XenRT, and so far they seem to not cause regressions (either with spec-ctrl=asi or without it), but XenRT no longer tests shadow paging or 32bit PV guests. This proposal is also missing an interface similar to map_domain_page() in order to create per-CPU mappings that don't use a fixmap entry. I thought however that the current content was fair enough for a first posting, and that I would like to get feedback on this before building further functionality on top of it. Note that none of the logic introduced in the series removes entries for the directmap, so evne when creating the per-CPU mappings the underlying physical addresses are fully accessible when using it's linear direct map entries. I also haven't done any benchmarking. Doesn't seem to cripple performance up to the point that XenRT jobs would timeout before finishing, that the only objective reference I can provide at the moment. It's likely to still have some rough edges, handle with care. Thanks, Roger. Roger Pau Monne (22): x86/mm: drop l{1,2,3,4}e_write_atomic() x86/mm: rename l{1,2,3,4}e_read_atomic() x86/dom0: only disable SMAP for the PV dom0 build x86/mm: ensure L4 idle_pg_table is not modified past boot x86/mm: make virt_to_xen_l1e() static x86/mm: introduce a local domain variable to write_ptbase() x86/spec-ctrl: initialize per-domain XPTI in spec_ctrl_init_domain() x86/mm: avoid passing a domain parameter to L4 init function x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush x86/mm: split setup of the per-domain slot on context switch x86/spec-ctrl: introduce Address Space Isolation command line option x86/hvm: use a per-pCPU monitor table in HAP mode x86/hvm: use a per-pCPU monitor table in shadow mode x86/idle: allow using a per-pCPU L4 x86/mm: introduce a per-CPU L3 table for the per-domain slot x86/mm: introduce support to populate a per-CPU page-table region x86/mm: allow modifying per-CPU entries of remote page-tables x86/mm: introduce a per-CPU fixmap area x86/pv: allow using a unique per-pCPU root page table (L4) x86/mm: switch to a per-CPU mapped stack when using ASI x86/mm: zero stack on stack switch or reset docs/misc/xen-command-line.pandoc | 15 +- xen/arch/x86/boot/x86_64.S | 11 + xen/arch/x86/domain.c | 75 +++- xen/arch/x86/domain_page.c | 2 +- xen/arch/x86/flushtlb.c | 18 +- xen/arch/x86/hvm/hvm.c | 67 ++++ xen/arch/x86/hvm/svm/svm.c | 5 + xen/arch/x86/hvm/vmx/vmcs.c | 1 + xen/arch/x86/hvm/vmx/vmx.c | 4 + xen/arch/x86/include/asm/config.h | 4 + xen/arch/x86/include/asm/current.h | 38 +- xen/arch/x86/include/asm/domain.h | 7 + xen/arch/x86/include/asm/fixmap.h | 50 +++ xen/arch/x86/include/asm/flushtlb.h | 3 +- xen/arch/x86/include/asm/hap.h | 1 - xen/arch/x86/include/asm/hvm/hvm.h | 8 + xen/arch/x86/include/asm/hvm/vcpu.h | 6 +- xen/arch/x86/include/asm/mm.h | 34 +- xen/arch/x86/include/asm/page.h | 37 +- xen/arch/x86/include/asm/paging.h | 18 + xen/arch/x86/include/asm/pv/mm.h | 8 + xen/arch/x86/include/asm/setup.h | 1 + xen/arch/x86/include/asm/smp.h | 12 + xen/arch/x86/include/asm/spec_ctrl.h | 2 + xen/arch/x86/include/asm/x86_64/page.h | 4 - xen/arch/x86/mm.c | 484 ++++++++++++++++++++----- xen/arch/x86/mm/hap/hap.c | 74 ---- xen/arch/x86/mm/paging.c | 4 +- xen/arch/x86/mm/shadow/common.c | 42 +-- xen/arch/x86/mm/shadow/hvm.c | 64 ++-- xen/arch/x86/mm/shadow/multi.c | 73 ++-- xen/arch/x86/mm/shadow/private.h | 4 +- xen/arch/x86/pv/dom0_build.c | 16 +- xen/arch/x86/pv/domain.c | 28 +- xen/arch/x86/pv/mm.c | 52 +++ xen/arch/x86/setup.c | 55 +-- xen/arch/x86/smp.c | 29 ++ xen/arch/x86/smpboot.c | 78 +++- xen/arch/x86/spec_ctrl.c | 78 +++- xen/arch/x86/traps.c | 14 +- xen/common/efi/runtime.c | 12 + xen/common/smp.c | 10 + xen/include/xen/smp.h | 5 + 43 files changed, 1198 insertions(+), 355 deletions(-) -- 2.45.2
The l{1,2,3,4}e_write_atomic() and non _atomic suffixed helpers share the same implementation, so it seems pointless and possibly confusing to have both. Remove the l{1,2,3,4}e_write_atomic() helpers and switch it's user to l{1,2,3,4}e_write(), as that's also atomic. While there also remove pte_write{,_atomic}() and just use write_atomic() in the wrappers. No functional change intended. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/page.h | 21 +++----------- xen/arch/x86/include/asm/x86_64/page.h | 2 -- xen/arch/x86/mm.c | 39 +++++++++++--------------- 3 files changed, 20 insertions(+), 42 deletions(-) diff --git a/xen/arch/x86/include/asm/page.h b/xen/arch/x86/include/asm/page.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/page.h +++ b/xen/arch/x86/include/asm/page.h @@ -XXX,XX +XXX,XX @@ l4e_from_intpte(pte_read_atomic(&l4e_get_intpte(*(l4ep)))) /* Write a pte atomically to memory. */ -#define l1e_write_atomic(l1ep, l1e) \ - pte_write_atomic(&l1e_get_intpte(*(l1ep)), l1e_get_intpte(l1e)) -#define l2e_write_atomic(l2ep, l2e) \ - pte_write_atomic(&l2e_get_intpte(*(l2ep)), l2e_get_intpte(l2e)) -#define l3e_write_atomic(l3ep, l3e) \ - pte_write_atomic(&l3e_get_intpte(*(l3ep)), l3e_get_intpte(l3e)) -#define l4e_write_atomic(l4ep, l4e) \ - pte_write_atomic(&l4e_get_intpte(*(l4ep)), l4e_get_intpte(l4e)) - -/* - * Write a pte safely but non-atomically to memory. - * The PTE may become temporarily not-present during the update. - */ #define l1e_write(l1ep, l1e) \ - pte_write(&l1e_get_intpte(*(l1ep)), l1e_get_intpte(l1e)) + write_atomic(&l1e_get_intpte(*(l1ep)), l1e_get_intpte(l1e)) #define l2e_write(l2ep, l2e) \ - pte_write(&l2e_get_intpte(*(l2ep)), l2e_get_intpte(l2e)) + write_atomic(&l2e_get_intpte(*(l2ep)), l2e_get_intpte(l2e)) #define l3e_write(l3ep, l3e) \ - pte_write(&l3e_get_intpte(*(l3ep)), l3e_get_intpte(l3e)) + write_atomic(&l3e_get_intpte(*(l3ep)), l3e_get_intpte(l3e)) #define l4e_write(l4ep, l4e) \ - pte_write(&l4e_get_intpte(*(l4ep)), l4e_get_intpte(l4e)) + write_atomic(&l4e_get_intpte(*(l4ep)), l4e_get_intpte(l4e)) /* Get direct integer representation of a pte's contents (intpte_t). */ #define l1e_get_intpte(x) ((x).l1) diff --git a/xen/arch/x86/include/asm/x86_64/page.h b/xen/arch/x86/include/asm/x86_64/page.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/x86_64/page.h +++ b/xen/arch/x86/include/asm/x86_64/page.h @@ -XXX,XX +XXX,XX @@ typedef l4_pgentry_t root_pgentry_t; #endif /* !__ASSEMBLY__ */ #define pte_read_atomic(ptep) read_atomic(ptep) -#define pte_write_atomic(ptep, pte) write_atomic(ptep, pte) -#define pte_write(ptep, pte) write_atomic(ptep, pte) /* Given a virtual address, get an entry offset into a linear page table. */ #define l1_linear_offset(_a) (((_a) & VADDR_MASK) >> L1_PAGETABLE_SHIFT) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( !(flags & (_PAGE_PAT | MAP_SMALL_PAGES)) ) { /* 1GB-page mapping. */ - l3e_write_atomic(pl3e, l3e_from_mfn(mfn, l1f_to_lNf(flags))); + l3e_write(pl3e, l3e_from_mfn(mfn, l1f_to_lNf(flags))); if ( (l3e_get_flags(ol3e) & _PAGE_PRESENT) ) { @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( if ( (l3e_get_flags(*pl3e) & _PAGE_PRESENT) && (l3e_get_flags(*pl3e) & _PAGE_PSE) ) { - l3e_write_atomic(pl3e, - l3e_from_mfn(l2mfn, __PAGE_HYPERVISOR)); + l3e_write(pl3e, l3e_from_mfn(l2mfn, __PAGE_HYPERVISOR)); l2mfn = INVALID_MFN; } if ( locking ) @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( { /* Super-page mapping. */ ol2e = *pl2e; - l2e_write_atomic(pl2e, l2e_from_mfn(mfn, l1f_to_lNf(flags))); + l2e_write(pl2e, l2e_from_mfn(mfn, l1f_to_lNf(flags))); if ( (l2e_get_flags(ol2e) & _PAGE_PRESENT) ) { @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( if ( (l2e_get_flags(*pl2e) & _PAGE_PRESENT) && (l2e_get_flags(*pl2e) & _PAGE_PSE) ) { - l2e_write_atomic(pl2e, l2e_from_mfn(l1mfn, - __PAGE_HYPERVISOR)); + l2e_write(pl2e, l2e_from_mfn(l1mfn, __PAGE_HYPERVISOR)); l1mfn = INVALID_MFN; } if ( locking ) @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( if ( !pl1e ) pl1e = map_l1t_from_l2e(*pl2e) + l1_table_offset(virt); ol1e = *pl1e; - l1e_write_atomic(pl1e, l1e_from_mfn(mfn, flags)); + l1e_write(pl1e, l1e_from_mfn(mfn, flags)); UNMAP_DOMAIN_PAGE(pl1e); if ( (l1e_get_flags(ol1e) & _PAGE_PRESENT) ) { @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( UNMAP_DOMAIN_PAGE(l1t); if ( i == L1_PAGETABLE_ENTRIES ) { - l2e_write_atomic(pl2e, l2e_from_pfn(base_mfn, - l1f_to_lNf(flags))); + l2e_write(pl2e, l2e_from_pfn(base_mfn, l1f_to_lNf(flags))); if ( locking ) spin_unlock(&map_pgdir_lock); flush_area(virt - PAGE_SIZE, @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( UNMAP_DOMAIN_PAGE(l2t); if ( i == L2_PAGETABLE_ENTRIES ) { - l3e_write_atomic(pl3e, l3e_from_pfn(base_mfn, - l1f_to_lNf(flags))); + l3e_write(pl3e, l3e_from_pfn(base_mfn, l1f_to_lNf(flags))); if ( locking ) spin_unlock(&map_pgdir_lock); flush_area(virt - PAGE_SIZE, @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) : l3e_from_pfn(l3e_get_pfn(*pl3e), (l3e_get_flags(*pl3e) & ~FLAGS_MASK) | nf); - l3e_write_atomic(pl3e, nl3e); + l3e_write(pl3e, nl3e); v += 1UL << L3_PAGETABLE_SHIFT; continue; } @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) if ( (l3e_get_flags(*pl3e) & _PAGE_PRESENT) && (l3e_get_flags(*pl3e) & _PAGE_PSE) ) { - l3e_write_atomic(pl3e, - l3e_from_mfn(l2mfn, __PAGE_HYPERVISOR)); + l3e_write(pl3e, l3e_from_mfn(l2mfn, __PAGE_HYPERVISOR)); l2mfn = INVALID_MFN; } if ( locking ) @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) : l2e_from_pfn(l2e_get_pfn(*pl2e), (l2e_get_flags(*pl2e) & ~FLAGS_MASK) | nf); - l2e_write_atomic(pl2e, nl2e); + l2e_write(pl2e, nl2e); v += 1UL << L2_PAGETABLE_SHIFT; } else @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) if ( (l2e_get_flags(*pl2e) & _PAGE_PRESENT) && (l2e_get_flags(*pl2e) & _PAGE_PSE) ) { - l2e_write_atomic(pl2e, l2e_from_mfn(l1mfn, - __PAGE_HYPERVISOR)); + l2e_write(pl2e, l2e_from_mfn(l1mfn, __PAGE_HYPERVISOR)); l1mfn = INVALID_MFN; } if ( locking ) @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) : l1e_from_pfn(l1e_get_pfn(*pl1e), (l1e_get_flags(*pl1e) & ~FLAGS_MASK) | nf); - l1e_write_atomic(pl1e, nl1e); + l1e_write(pl1e, nl1e); UNMAP_DOMAIN_PAGE(pl1e); v += PAGE_SIZE; @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) if ( i == L1_PAGETABLE_ENTRIES ) { /* Empty: zap the L2E and free the L1 page. */ - l2e_write_atomic(pl2e, l2e_empty()); + l2e_write(pl2e, l2e_empty()); if ( locking ) spin_unlock(&map_pgdir_lock); flush_area(NULL, FLUSH_TLB_GLOBAL); /* flush before free */ @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) if ( i == L2_PAGETABLE_ENTRIES ) { /* Empty: zap the L3E and free the L2 page. */ - l3e_write_atomic(pl3e, l3e_empty()); + l3e_write(pl3e, l3e_empty()); if ( locking ) spin_unlock(&map_pgdir_lock); flush_area(NULL, FLUSH_TLB_GLOBAL); /* flush before free */ @@ -XXX,XX +XXX,XX @@ void init_or_livepatch modify_xen_mappings_lite( { ASSERT(IS_ALIGNED(v, 1UL << L2_PAGETABLE_SHIFT)); - l2e_write_atomic(pl2e, l2e_from_intpte((l2e.l2 & ~fm) | flags)); + l2e_write(pl2e, l2e_from_intpte((l2e.l2 & ~fm) | flags)); v += 1UL << L2_PAGETABLE_SHIFT; continue; @@ -XXX,XX +XXX,XX @@ void init_or_livepatch modify_xen_mappings_lite( ASSERT(l1f & _PAGE_PRESENT); - l1e_write_atomic(pl1e, - l1e_from_intpte((l1e.l1 & ~fm) | flags)); + l1e_write(pl1e, l1e_from_intpte((l1e.l1 & ~fm) | flags)); v += 1UL << L1_PAGETABLE_SHIFT; -- 2.45.2
There's no l{1,2,3,4}e_read() implementation, so drop the _atomic suffix from the read helpers. This allows unifying the naming with the write helpers, which are also atomic but don't have the suffix already: l{1,2,3,4}e_write(). No functional change intended. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/page.h | 16 ++++++++-------- xen/arch/x86/include/asm/x86_64/page.h | 2 -- xen/arch/x86/mm.c | 12 ++++++------ xen/arch/x86/traps.c | 8 ++++---- 4 files changed, 18 insertions(+), 20 deletions(-) diff --git a/xen/arch/x86/include/asm/page.h b/xen/arch/x86/include/asm/page.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/page.h +++ b/xen/arch/x86/include/asm/page.h @@ -XXX,XX +XXX,XX @@ #include <asm/x86_64/page.h> /* Read a pte atomically from memory. */ -#define l1e_read_atomic(l1ep) \ - l1e_from_intpte(pte_read_atomic(&l1e_get_intpte(*(l1ep)))) -#define l2e_read_atomic(l2ep) \ - l2e_from_intpte(pte_read_atomic(&l2e_get_intpte(*(l2ep)))) -#define l3e_read_atomic(l3ep) \ - l3e_from_intpte(pte_read_atomic(&l3e_get_intpte(*(l3ep)))) -#define l4e_read_atomic(l4ep) \ - l4e_from_intpte(pte_read_atomic(&l4e_get_intpte(*(l4ep)))) +#define l1e_read(l1ep) \ + l1e_from_intpte(read_atomic(&l1e_get_intpte(*(l1ep)))) +#define l2e_read(l2ep) \ + l2e_from_intpte(read_atomic(&l2e_get_intpte(*(l2ep)))) +#define l3e_read(l3ep) \ + l3e_from_intpte(read_atomic(&l3e_get_intpte(*(l3ep)))) +#define l4e_read(l4ep) \ + l4e_from_intpte(read_atomic(&l4e_get_intpte(*(l4ep)))) /* Write a pte atomically to memory. */ #define l1e_write(l1ep, l1e) \ diff --git a/xen/arch/x86/include/asm/x86_64/page.h b/xen/arch/x86/include/asm/x86_64/page.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/x86_64/page.h +++ b/xen/arch/x86/include/asm/x86_64/page.h @@ -XXX,XX +XXX,XX @@ typedef l4_pgentry_t root_pgentry_t; #endif /* !__ASSEMBLY__ */ -#define pte_read_atomic(ptep) read_atomic(ptep) - /* Given a virtual address, get an entry offset into a linear page table. */ #define l1_linear_offset(_a) (((_a) & VADDR_MASK) >> L1_PAGETABLE_SHIFT) #define l2_linear_offset(_a) (((_a) & VADDR_MASK) >> L2_PAGETABLE_SHIFT) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ static int mod_l1_entry(l1_pgentry_t *pl1e, l1_pgentry_t nl1e, struct vcpu *pt_vcpu, struct domain *pg_dom) { bool preserve_ad = (cmd == MMU_PT_UPDATE_PRESERVE_AD); - l1_pgentry_t ol1e = l1e_read_atomic(pl1e); + l1_pgentry_t ol1e = l1e_read(pl1e); struct domain *pt_dom = pt_vcpu->domain; int rc = 0; @@ -XXX,XX +XXX,XX @@ static int mod_l2_entry(l2_pgentry_t *pl2e, return -EPERM; } - ol2e = l2e_read_atomic(pl2e); + ol2e = l2e_read(pl2e); if ( l2e_get_flags(nl2e) & _PAGE_PRESENT ) { @@ -XXX,XX +XXX,XX @@ static int mod_l3_entry(l3_pgentry_t *pl3e, if ( pgentry_ptr_to_slot(pl3e) >= 3 && is_pv_32bit_domain(d) ) return -EINVAL; - ol3e = l3e_read_atomic(pl3e); + ol3e = l3e_read(pl3e); if ( l3e_get_flags(nl3e) & _PAGE_PRESENT ) { @@ -XXX,XX +XXX,XX @@ static int mod_l4_entry(l4_pgentry_t *pl4e, return -EINVAL; } - ol4e = l4e_read_atomic(pl4e); + ol4e = l4e_read(pl4e); if ( l4e_get_flags(nl4e) & _PAGE_PRESENT ) { @@ -XXX,XX +XXX,XX @@ void init_or_livepatch modify_xen_mappings_lite( while ( v < e ) { l2_pgentry_t *pl2e = &l2_xenmap[l2_table_offset(v)]; - l2_pgentry_t l2e = l2e_read_atomic(pl2e); + l2_pgentry_t l2e = l2e_read(pl2e); unsigned int l2f = l2e_get_flags(l2e); ASSERT(l2f & _PAGE_PRESENT); @@ -XXX,XX +XXX,XX @@ void init_or_livepatch modify_xen_mappings_lite( while ( v < e ) { l1_pgentry_t *pl1e = &pl1t[l1_table_offset(v)]; - l1_pgentry_t l1e = l1e_read_atomic(pl1e); + l1_pgentry_t l1e = l1e_read(pl1e); unsigned int l1f = l1e_get_flags(l1e); ASSERT(l1f & _PAGE_PRESENT); diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -XXX,XX +XXX,XX @@ static enum pf_type __page_fault_type(unsigned long addr, mfn = cr3 >> PAGE_SHIFT; l4t = map_domain_page(_mfn(mfn)); - l4e = l4e_read_atomic(&l4t[l4_table_offset(addr)]); + l4e = l4e_read(&l4t[l4_table_offset(addr)]); mfn = l4e_get_pfn(l4e); unmap_domain_page(l4t); if ( ((l4e_get_flags(l4e) & required_flags) != required_flags) || @@ -XXX,XX +XXX,XX @@ static enum pf_type __page_fault_type(unsigned long addr, page_user &= l4e_get_flags(l4e); l3t = map_domain_page(_mfn(mfn)); - l3e = l3e_read_atomic(&l3t[l3_table_offset(addr)]); + l3e = l3e_read(&l3t[l3_table_offset(addr)]); mfn = l3e_get_pfn(l3e); unmap_domain_page(l3t); if ( ((l3e_get_flags(l3e) & required_flags) != required_flags) || @@ -XXX,XX +XXX,XX @@ static enum pf_type __page_fault_type(unsigned long addr, goto leaf; l2t = map_domain_page(_mfn(mfn)); - l2e = l2e_read_atomic(&l2t[l2_table_offset(addr)]); + l2e = l2e_read(&l2t[l2_table_offset(addr)]); mfn = l2e_get_pfn(l2e); unmap_domain_page(l2t); if ( ((l2e_get_flags(l2e) & required_flags) != required_flags) || @@ -XXX,XX +XXX,XX @@ static enum pf_type __page_fault_type(unsigned long addr, goto leaf; l1t = map_domain_page(_mfn(mfn)); - l1e = l1e_read_atomic(&l1t[l1_table_offset(addr)]); + l1e = l1e_read(&l1t[l1_table_offset(addr)]); mfn = l1e_get_pfn(l1e); unmap_domain_page(l1t); if ( ((l1e_get_flags(l1e) & required_flags) != required_flags) || -- 2.45.2
The PVH dom0 builder doesn't switch page tables and has no need to run with SMAP disabled. Put the SMAP disabling close to the code region where it's necessary, as it then becomes obvious why switch_cr3_cr4() is required instead of write_ptbase(). Note removing SMAP from cr4_pv32_mask is not required, as we never jump into guest context, and hence updating the value of cr4_pv32_mask is not relevant. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/pv/dom0_build.c | 13 ++++++++++--- xen/arch/x86/setup.c | 17 ----------------- 2 files changed, 10 insertions(+), 20 deletions(-) diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/dom0_build.c +++ b/xen/arch/x86/pv/dom0_build.c @@ -XXX,XX +XXX,XX @@ int __init dom0_construct_pv(struct domain *d, unsigned long alloc_epfn; unsigned long initrd_pfn = -1, initrd_mfn = 0; unsigned long count; + unsigned long cr4; struct page_info *page = NULL; unsigned int flush_flags = 0; start_info_t *si; @@ -XXX,XX +XXX,XX @@ int __init dom0_construct_pv(struct domain *d, /* Set up CR3 value for switch_cr3_cr4(). */ update_cr3(v); + /* + * Temporarily clear SMAP in CR4 to allow user-accesses when running with + * the dom0 page-tables. Cache the value of CR4 so it can be restored. + */ + cr4 = read_cr4(); + /* We run on dom0's page tables for the final part of the build process. */ - switch_cr3_cr4(cr3_pa(v->arch.cr3), read_cr4()); + switch_cr3_cr4(cr3_pa(v->arch.cr3), cr4 & ~X86_CR4_SMAP); mapcache_override_current(v); /* Copy the OS image and free temporary buffer. */ @@ -XXX,XX +XXX,XX @@ int __init dom0_construct_pv(struct domain *d, (parms.virt_hypercall >= v_end) ) { mapcache_override_current(NULL); - switch_cr3_cr4(current->arch.cr3, read_cr4()); + switch_cr3_cr4(current->arch.cr3, cr4); printk("Invalid HYPERCALL_PAGE field in ELF notes.\n"); return -EINVAL; } @@ -XXX,XX +XXX,XX @@ int __init dom0_construct_pv(struct domain *d, /* Return to idle domain's page tables. */ mapcache_override_current(NULL); - switch_cr3_cr4(current->arch.cr3, read_cr4()); + switch_cr3_cr4(current->arch.cr3, cr4); update_domain_wallclock_time(d); diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -XXX,XX +XXX,XX @@ static struct domain *__init create_dom0(const module_t *image, } } - /* - * Temporarily clear SMAP in CR4 to allow user-accesses in construct_dom0(). - * This saves a large number of corner cases interactions with - * copy_from_user(). - */ - if ( cpu_has_smap ) - { - cr4_pv32_mask &= ~X86_CR4_SMAP; - write_cr4(read_cr4() & ~X86_CR4_SMAP); - } - if ( construct_dom0(d, image, headroom, initrd, cmdline) != 0 ) panic("Could not construct domain 0\n"); - if ( cpu_has_smap ) - { - write_cr4(read_cr4() | X86_CR4_SMAP); - cr4_pv32_mask |= X86_CR4_SMAP; - } - return d; } -- 2.45.2
The idle_pg_table L4 is cloned to create all the other L4 Xen uses, and hence it shouldn't be modified once further L4 are created. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/mm.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ static l3_pgentry_t *virt_to_xen_l3e(unsigned long v) mfn_t l3mfn; l3_pgentry_t *l3t = alloc_mapped_pagetable(&l3mfn); + /* + * dom0 is build at smp_boot, at which point we already create new L4s + * based on idle_pg_table. + */ + BUG_ON(system_state >= SYS_STATE_smp_boot); + if ( !l3t ) return NULL; UNMAP_DOMAIN_PAGE(l3t); -- 2.45.2
There are no callers outside the translation unit where it's defined, so make the function static. No functional change intended. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/mm.h | 2 -- xen/arch/x86/mm.c | 2 +- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -XXX,XX +XXX,XX @@ mfn_t alloc_xen_pagetable(void); void free_xen_pagetable(mfn_t mfn); void *alloc_mapped_pagetable(mfn_t *pmfn); -l1_pgentry_t *virt_to_xen_l1e(unsigned long v); - int __sync_local_execstate(void); /* Arch-specific portion of memory_op hypercall. */ diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ static l2_pgentry_t *virt_to_xen_l2e(unsigned long v) return map_l2t_from_l3e(l3e) + l2_table_offset(v); } -l1_pgentry_t *virt_to_xen_l1e(unsigned long v) +static l1_pgentry_t *virt_to_xen_l1e(unsigned long v) { l2_pgentry_t *pl2e, l2e; -- 2.45.2
This reduces the repeated accessing of v->domain. No functional change intended. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/mm.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ void make_cr3(struct vcpu *v, mfn_t mfn) void write_ptbase(struct vcpu *v) { + const struct domain *d = v->domain; struct cpu_info *cpu_info = get_cpu_info(); unsigned long new_cr4; - new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v)) + new_cr4 = (is_pv_domain(d) && !is_idle_domain(d)) ? pv_make_cr4(v) : mmu_cr4_features; - if ( is_pv_vcpu(v) && v->domain->arch.pv.xpti ) + if ( is_pv_domain(d) && d->arch.pv.xpti ) { cpu_info->root_pgt_changed = true; cpu_info->pv_cr3 = __pa(this_cpu(root_pgt)); -- 2.45.2
XPTI being a speculation mitigation feels better to be initialized in spec_ctrl_init_domain(). No functional change intended, although the call to spec_ctrl_init_domain() in arch_domain_create() needs to be moved ahead of pv_domain_initialise() for d->->arch.pv.xpti to be correctly set. Move it ahead of most of the initialization functions, since spec_ctrl_init_domain() doesn't depend on any member in the struct domain being set. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/domain.c | 4 ++-- xen/arch/x86/pv/domain.c | 2 -- xen/arch/x86/spec_ctrl.c | 4 ++++ 3 files changed, 6 insertions(+), 4 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -XXX,XX +XXX,XX @@ int arch_domain_create(struct domain *d, is_pv_domain(d) ? __HYPERVISOR_COMPAT_VIRT_START : ~0u; #endif + spec_ctrl_init_domain(d); + if ( (rc = paging_domain_init(d)) != 0 ) goto fail; paging_initialised = true; @@ -XXX,XX +XXX,XX @@ int arch_domain_create(struct domain *d, d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED; - spec_ctrl_init_domain(d); - return 0; fail: diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ int pv_domain_initialise(struct domain *d) d->arch.ctxt_switch = &pv_csw; - d->arch.pv.xpti = is_hardware_domain(d) ? opt_xpti_hwdom : opt_xpti_domu; - if ( !is_pv_32bit_domain(d) && use_invpcid && cpu_has_pcid ) switch ( ACCESS_ONCE(opt_pcid) ) { diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/spec_ctrl.c +++ b/xen/arch/x86/spec_ctrl.c @@ -XXX,XX +XXX,XX @@ void spec_ctrl_init_domain(struct domain *d) (ibpb ? SCF_entry_ibpb : 0) | (bhb ? SCF_entry_bhb : 0) | 0; + + if ( pv ) + d->arch.pv.xpti = is_hardware_domain(d) ? opt_xpti_hwdom + : opt_xpti_domu; } void __init init_speculation_mitigations(void) -- 2.45.2
In preparation for the function being called from contexts where no domain is present. No functional change intended. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/mm.h | 4 +++- xen/arch/x86/mm.c | 24 +++++++++++++----------- xen/arch/x86/mm/hap/hap.c | 3 ++- xen/arch/x86/mm/shadow/hvm.c | 3 ++- xen/arch/x86/mm/shadow/multi.c | 7 +++++-- xen/arch/x86/pv/dom0_build.c | 3 ++- xen/arch/x86/pv/domain.c | 3 ++- 7 files changed, 29 insertions(+), 18 deletions(-) diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -XXX,XX +XXX,XX @@ int devalidate_page(struct page_info *page, unsigned long type, void init_xen_pae_l2_slots(l2_pgentry_t *l2t, const struct domain *d); void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn, - const struct domain *d, mfn_t sl4mfn, bool ro_mpt); + mfn_t sl4mfn, const struct page_info *perdomain_l3, + bool ro_mpt, bool maybe_compat, bool short_directmap); + bool fill_ro_mpt(mfn_t mfn); void zap_ro_mpt(mfn_t mfn); diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ static int promote_l3_table(struct page_info *page) * extended directmap. */ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn, - const struct domain *d, mfn_t sl4mfn, bool ro_mpt) + mfn_t sl4mfn, const struct page_info *perdomain_l3, + bool ro_mpt, bool maybe_compat, bool short_directmap) { - /* - * PV vcpus need a shortened directmap. HVM and Idle vcpus get the full - * directmap. - */ - bool short_directmap = !paging_mode_external(d); - /* Slot 256: RO M2P (if applicable). */ l4t[l4_table_offset(RO_MPT_VIRT_START)] = ro_mpt ? idle_pg_table[l4_table_offset(RO_MPT_VIRT_START)] @@ -XXX,XX +XXX,XX @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn, l4e_from_mfn(sl4mfn, __PAGE_HYPERVISOR_RW); /* Slot 260: Per-domain mappings. */ - l4t[l4_table_offset(PERDOMAIN_VIRT_START)] = - l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW); + if ( perdomain_l3 ) + l4t[l4_table_offset(PERDOMAIN_VIRT_START)] = + l4e_from_page(perdomain_l3, __PAGE_HYPERVISOR_RW); /* Slot 4: Per-domain mappings mirror. */ BUILD_BUG_ON(IS_ENABLED(CONFIG_PV32) && !l4_table_offset(PERDOMAIN_ALT_VIRT_START)); - if ( !is_pv_64bit_domain(d) ) + if ( perdomain_l3 && maybe_compat ) l4t[l4_table_offset(PERDOMAIN_ALT_VIRT_START)] = l4t[l4_table_offset(PERDOMAIN_VIRT_START)]; @@ -XXX,XX +XXX,XX @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn, else #endif { + /* + * PV vcpus need a shortened directmap. HVM and Idle vcpus get the full + * directmap. + */ unsigned int slots = (short_directmap ? ROOT_PAGETABLE_PV_XEN_SLOTS : ROOT_PAGETABLE_XEN_SLOTS); @@ -XXX,XX +XXX,XX @@ static int promote_l4_table(struct page_info *page) if ( !rc ) { init_xen_l4_slots(pl4e, l4mfn, - d, INVALID_MFN, VM_ASSIST(d, m2p_strict)); + INVALID_MFN, d->arch.perdomain_l3_pg, + VM_ASSIST(d, m2p_strict), !is_pv_64bit_domain(d), + true); atomic_inc(&d->arch.pv.nr_l4_pages); } unmap_domain_page(pl4e); diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/hap/hap.c +++ b/xen/arch/x86/mm/hap/hap.c @@ -XXX,XX +XXX,XX @@ static mfn_t hap_make_monitor_table(struct vcpu *v) m4mfn = page_to_mfn(pg); l4e = map_domain_page(m4mfn); - init_xen_l4_slots(l4e, m4mfn, d, INVALID_MFN, false); + init_xen_l4_slots(l4e, m4mfn, INVALID_MFN, d->arch.perdomain_l3_pg, + false, true, false); unmap_domain_page(l4e); return m4mfn; diff --git a/xen/arch/x86/mm/shadow/hvm.c b/xen/arch/x86/mm/shadow/hvm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/shadow/hvm.c +++ b/xen/arch/x86/mm/shadow/hvm.c @@ -XXX,XX +XXX,XX @@ mfn_t sh_make_monitor_table(const struct vcpu *v, unsigned int shadow_levels) * shadow-linear mapping will either be inserted below when creating * lower level monitor tables, or later in sh_update_cr3(). */ - init_xen_l4_slots(l4e, m4mfn, d, INVALID_MFN, false); + init_xen_l4_slots(l4e, m4mfn, INVALID_MFN, d->arch.perdomain_l3_pg, + false, true, false); if ( shadow_levels < 4 ) { diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/shadow/multi.c +++ b/xen/arch/x86/mm/shadow/multi.c @@ -XXX,XX +XXX,XX @@ sh_make_shadow(struct vcpu *v, mfn_t gmfn, u32 shadow_type) BUILD_BUG_ON(sizeof(l4_pgentry_t) != sizeof(shadow_l4e_t)); - init_xen_l4_slots(l4t, gmfn, d, smfn, (!is_pv_32bit_domain(d) && - VM_ASSIST(d, m2p_strict))); + init_xen_l4_slots(l4t, gmfn, smfn, + d->arch.perdomain_l3_pg, + (!is_pv_32bit_domain(d) && + VM_ASSIST(d, m2p_strict)), + !is_pv_64bit_domain(d), true); unmap_domain_page(l4t); } break; diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/dom0_build.c +++ b/xen/arch/x86/pv/dom0_build.c @@ -XXX,XX +XXX,XX @@ int __init dom0_construct_pv(struct domain *d, l4start = l4tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE; clear_page(l4tab); init_xen_l4_slots(l4tab, _mfn(virt_to_mfn(l4start)), - d, INVALID_MFN, true); + INVALID_MFN, d->arch.perdomain_l3_pg, + true, !is_pv_64bit_domain(d), true); v->arch.guest_table = pagetable_from_paddr(__pa(l4start)); } else diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ static int setup_compat_l4(struct vcpu *v) mfn = page_to_mfn(pg); l4tab = map_domain_page(mfn); clear_page(l4tab); - init_xen_l4_slots(l4tab, mfn, v->domain, INVALID_MFN, false); + init_xen_l4_slots(l4tab, mfn, INVALID_MFN, v->domain->arch.perdomain_l3_pg, + false, true, true); unmap_domain_page(l4tab); /* This page needs to look like a pagetable so that it can be shadowed */ -- 2.45.2
The current logic gates issuing flush TLB requests with the FLUSH_ROOT_PGTBL flag to XPTI being enabled. In preparation for FLUSH_ROOT_PGTBL also being needed when not using XPTI, untie it from the xpti domain boolean and instead introduce a new flush_root_pt field. No functional change intended, as flush_root_pt == xpti. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/domain.h | 2 ++ xen/arch/x86/include/asm/flushtlb.h | 2 +- xen/arch/x86/mm.c | 2 +- xen/arch/x86/pv/domain.c | 2 ++ 4 files changed, 6 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct pv_domain bool pcid; /* Mitigate L1TF with shadow/crashing? */ bool check_l1tf; + /* Issue FLUSH_ROOT_PGTBL for root page-table changes. */ + bool flush_root_pt; /* map_domain_page() mapping cache. */ struct mapcache_domain mapcache; diff --git a/xen/arch/x86/include/asm/flushtlb.h b/xen/arch/x86/include/asm/flushtlb.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/flushtlb.h +++ b/xen/arch/x86/include/asm/flushtlb.h @@ -XXX,XX +XXX,XX @@ void flush_area_mask(const cpumask_t *mask, const void *va, #define flush_root_pgtbl_domain(d) \ { \ - if ( is_pv_domain(d) && (d)->arch.pv.xpti ) \ + if ( is_pv_domain(d) && (d)->arch.pv.flush_root_pt ) \ flush_mask((d)->dirty_cpumask, FLUSH_ROOT_PGTBL); \ } diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ long do_mmu_update( cmd == MMU_PT_UPDATE_PRESERVE_AD, v); if ( !rc ) flush_linear_pt = true; - if ( !rc && pt_owner->arch.pv.xpti ) + if ( !rc && pt_owner->arch.pv.flush_root_pt ) { bool local_in_use = false; diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ int pv_domain_initialise(struct domain *d) d->arch.ctxt_switch = &pv_csw; + d->arch.pv.flush_root_pt = d->arch.pv.xpti; + if ( !is_pv_32bit_domain(d) && use_invpcid && cpu_has_pcid ) switch ( ACCESS_ONCE(opt_pcid) ) { -- 2.45.2
Move the handling of FLUSH_ROOT_PGTBL in flush_area_local() ahead of the logic that does the TLB flushing, in preparation for further changes requiring the TLB flush to be strictly done after having handled FLUSH_ROOT_PGTBL. No functional change intended. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/flushtlb.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/flushtlb.c +++ b/xen/arch/x86/flushtlb.c @@ -XXX,XX +XXX,XX @@ unsigned int flush_area_local(const void *va, unsigned int flags) { unsigned int order = (flags - 1) & FLUSH_ORDER_MASK; + if ( flags & FLUSH_ROOT_PGTBL ) + get_cpu_info()->root_pgt_changed = true; + if ( flags & (FLUSH_TLB|FLUSH_TLB_GLOBAL) ) { if ( order == 0 ) @@ -XXX,XX +XXX,XX @@ unsigned int flush_area_local(const void *va, unsigned int flags) } } - if ( flags & FLUSH_ROOT_PGTBL ) - get_cpu_info()->root_pgt_changed = true; - return flags; } -- 2.45.2
It's currently only used for XPTI. Move the code to a separate helper in preparation for it gaining more logic. While there switch to using l4e_write(): in the current context the L4 is not active when modified, but that could change. No functional change intended. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/domain.c | 4 +--- xen/arch/x86/include/asm/mm.h | 3 +++ xen/arch/x86/mm.c | 7 +++++++ 3 files changed, 11 insertions(+), 3 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -XXX,XX +XXX,XX @@ void cf_check paravirt_ctxt_switch_to(struct vcpu *v) root_pgentry_t *root_pgt = this_cpu(root_pgt); if ( root_pgt ) - root_pgt[root_table_offset(PERDOMAIN_VIRT_START)] = - l4e_from_page(v->domain->arch.perdomain_l3_pg, - __PAGE_HYPERVISOR_RW); + setup_perdomain_slot(v, root_pgt); if ( unlikely(v->arch.dr7 & DR7_ACTIVE_MASK) ) activate_debugregs(v); diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -XXX,XX +XXX,XX @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr) return (mfn + nr) <= (virt_to_mfn(eva - 1) + 1); } +/* Setup the per-domain slot in the root page table pointer. */ +void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt); + #endif /* __ASM_X86_MM_H__ */ diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ unsigned long get_upper_mfn_bound(void) return min(max_mfn, 1UL << (paddr_bits - PAGE_SHIFT)) - 1; } +void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt) +{ + l4e_write(&root_pgt[root_table_offset(PERDOMAIN_VIRT_START)], + l4e_from_page(v->domain->arch.perdomain_l3_pg, + __PAGE_HYPERVISOR_RW)); +} + static void __init __maybe_unused build_assertions(void) { /* -- 2.45.2
No functional change, as the option is not used. Introduced new so newly added functionality is keyed on the option being enabled, even if the feature is non-functional. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- docs/misc/xen-command-line.pandoc | 15 ++++-- xen/arch/x86/include/asm/domain.h | 3 ++ xen/arch/x86/include/asm/spec_ctrl.h | 2 + xen/arch/x86/spec_ctrl.c | 74 +++++++++++++++++++++++++--- 4 files changed, 81 insertions(+), 13 deletions(-) diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc index XXXXXXX..XXXXXXX 100644 --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -XXX,XX +XXX,XX @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`). ### spec-ctrl (x86) > `= List of [ <bool>, xen=<bool>, {pv,hvm}=<bool>, -> {msr-sc,rsb,verw,{ibpb,bhb}-entry}=<bool>|{pv,hvm}=<bool>, +> {msr-sc,rsb,verw,{ibpb,bhb}-entry,asi}=<bool>|{pv,hvm}=<bool>, > bti-thunk=retpoline|lfence|jmp,bhb-seq=short|tsx|long, > {ibrs,ibpb,ssbd,psfd, > eager-fpu,l1d-flush,branch-harden,srb-lock, @@ -XXX,XX +XXX,XX @@ in place for guests to use. Use of a positive boolean value for either of these options is invalid. -The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=`, `ibpb-entry=` and `bhb-entry=` -options offer fine grained control over the primitives by Xen. These impact -Xen's ability to protect itself, and/or Xen's ability to virtualise support -for guests to use. +The `pv=`, `hvm=`, `msr-sc=`, `rsb=`, `verw=`, `ibpb-entry=`, `bhb-entry=` and +`asi=` options offer fine grained control over the primitives by Xen. These +impact Xen's ability to protect itself, and/or Xen's ability to virtualise +support for guests to use. * `pv=` and `hvm=` offer control over all suboptions for PV and HVM guests respectively. @@ -XXX,XX +XXX,XX @@ for guests to use. is not available (see `bhi-dis-s`). The choice of scrubbing sequence can be selected using the `bhb-seq=` option. If it is necessary to protect dom0 too, boot with `spec-ctrl=bhb-entry`. +* `asi=` offers control over whether the hypervisor will engage in Address + Space Isolation, by not having sensitive information mapped in the VMM + page-tables. Not having sensitive information on the page-tables avoids + having to perform some mitigations for speculative attacks when + context-switching to the hypervisor. If Xen was compiled with `CONFIG_INDIRECT_THUNK` support, `bti-thunk=` can be used to select which of the thunks gets patched into the diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct arch_domain /* Don't unconditionally inject #GP for unhandled MSRs. */ bool msr_relaxed; + /* Run the guest without sensitive information in the VMM page-tables. */ + bool asi; + /* Emulated devices enabled bitmap. */ uint32_t emulation_flags; } __cacheline_aligned; diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/spec_ctrl.h +++ b/xen/arch/x86/include/asm/spec_ctrl.h @@ -XXX,XX +XXX,XX @@ extern uint8_t default_scf; extern int8_t opt_xpti_hwdom, opt_xpti_domu; +extern int8_t opt_asi_pv, opt_asi_hwdom, opt_asi_hvm; + extern bool cpu_has_bug_l1tf; extern int8_t opt_pv_l1tf_hwdom, opt_pv_l1tf_domu; diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/spec_ctrl.c +++ b/xen/arch/x86/spec_ctrl.c @@ -XXX,XX +XXX,XX @@ static bool __ro_after_init opt_verw_mmio; static int8_t __initdata opt_gds_mit = -1; static int8_t __initdata opt_div_scrub = -1; +/* Address Space Isolation for PV/HVM. */ +int8_t __ro_after_init opt_asi_pv = -1; +int8_t __ro_after_init opt_asi_hwdom = -1; +int8_t __ro_after_init opt_asi_hvm = -1; + static int __init cf_check parse_spec_ctrl(const char *s) { const char *ss; @@ -XXX,XX +XXX,XX @@ static int __init cf_check parse_spec_ctrl(const char *s) opt_unpriv_mmio = false; opt_gds_mit = 0; opt_div_scrub = 0; + + opt_asi_pv = 0; + opt_asi_hwdom = 0; + opt_asi_hvm = 0; } else if ( val > 0 ) rc = -EINVAL; @@ -XXX,XX +XXX,XX @@ static int __init cf_check parse_spec_ctrl(const char *s) opt_verw_pv = val; opt_ibpb_entry_pv = val; opt_bhb_entry_pv = val; + opt_asi_pv = val; } else if ( (val = parse_boolean("hvm", s, ss)) >= 0 ) { @@ -XXX,XX +XXX,XX @@ static int __init cf_check parse_spec_ctrl(const char *s) opt_verw_hvm = val; opt_ibpb_entry_hvm = val; opt_bhb_entry_hvm = val; + opt_asi_hvm = val; } else if ( (val = parse_boolean("msr-sc", s, ss)) != -1 ) { @@ -XXX,XX +XXX,XX @@ static int __init cf_check parse_spec_ctrl(const char *s) break; } } + else if ( (val = parse_boolean("asi", s, ss)) != -1 ) + { + switch ( val ) + { + case 0: + case 1: + opt_asi_pv = opt_asi_hwdom = opt_asi_hvm = val; + break; + + case -2: + s += strlen("asi="); + if ( (val = parse_boolean("pv", s, ss)) >= 0 ) + opt_asi_pv = val; + else if ( (val = parse_boolean("hvm", s, ss)) >= 0 ) + opt_asi_hvm = val; + else + default: + rc = -EINVAL; + break; + } + } /* Xen's speculative sidechannel mitigation settings. */ else if ( !strncmp(s, "bti-thunk=", 10) ) @@ -XXX,XX +XXX,XX @@ int8_t __ro_after_init opt_xpti_domu = -1; static __init void xpti_init_default(void) { + ASSERT(opt_asi_pv >= 0 && opt_asi_hwdom >= 0); + if ( (opt_xpti_hwdom == 1 || opt_xpti_domu == 1) && opt_asi_pv == 1 ) + { + printk(XENLOG_ERR + "XPTI is incompatible with Address Space Isolation - disabling ASI\n"); + opt_asi_pv = 0; + } if ( (boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) || cpu_has_rdcl_no ) { @@ -XXX,XX +XXX,XX @@ static __init void xpti_init_default(void) else { if ( opt_xpti_hwdom < 0 ) - opt_xpti_hwdom = 1; + opt_xpti_hwdom = !opt_asi_hwdom; if ( opt_xpti_domu < 0 ) - opt_xpti_domu = 1; + opt_xpti_domu = !opt_asi_pv; } } @@ -XXX,XX +XXX,XX @@ static void __init print_details(enum ind_thunk thunk) * mitigation support for guests. */ #ifdef CONFIG_HVM - printk(" Support for HVM VMs:%s%s%s%s%s%s%s%s\n", + printk(" Support for HVM VMs:%s%s%s%s%s%s%s%s%s\n", (boot_cpu_has(X86_FEATURE_SC_MSR_HVM) || boot_cpu_has(X86_FEATURE_SC_RSB_HVM) || boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM) || opt_bhb_entry_hvm || amd_virt_spec_ctrl || - opt_eager_fpu || opt_verw_hvm) ? "" : " None", + opt_eager_fpu || opt_verw_hvm || + opt_asi_hvm) ? "" : " None", boot_cpu_has(X86_FEATURE_SC_MSR_HVM) ? " MSR_SPEC_CTRL" : "", (boot_cpu_has(X86_FEATURE_SC_MSR_HVM) || amd_virt_spec_ctrl) ? " MSR_VIRT_SPEC_CTRL" : "", @@ -XXX,XX +XXX,XX @@ static void __init print_details(enum ind_thunk thunk) opt_eager_fpu ? " EAGER_FPU" : "", opt_verw_hvm ? " VERW" : "", boot_cpu_has(X86_FEATURE_IBPB_ENTRY_HVM) ? " IBPB-entry" : "", - opt_bhb_entry_hvm ? " BHB-entry" : ""); + opt_bhb_entry_hvm ? " BHB-entry" : "", + opt_asi_hvm ? " ASI" : ""); #endif #ifdef CONFIG_PV - printk(" Support for PV VMs:%s%s%s%s%s%s%s\n", + printk(" Support for PV VMs:%s%s%s%s%s%s%s%s\n", (boot_cpu_has(X86_FEATURE_SC_MSR_PV) || boot_cpu_has(X86_FEATURE_SC_RSB_PV) || boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) || - opt_bhb_entry_pv || + opt_bhb_entry_pv || opt_asi_pv || opt_eager_fpu || opt_verw_pv) ? "" : " None", boot_cpu_has(X86_FEATURE_SC_MSR_PV) ? " MSR_SPEC_CTRL" : "", boot_cpu_has(X86_FEATURE_SC_RSB_PV) ? " RSB" : "", opt_eager_fpu ? " EAGER_FPU" : "", opt_verw_pv ? " VERW" : "", boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ? " IBPB-entry" : "", - opt_bhb_entry_pv ? " BHB-entry" : ""); + opt_bhb_entry_pv ? " BHB-entry" : "", + opt_asi_pv ? " ASI" : ""); printk(" XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n", opt_xpti_hwdom ? "enabled" : "disabled", @@ -XXX,XX +XXX,XX @@ void spec_ctrl_init_domain(struct domain *d) if ( pv ) d->arch.pv.xpti = is_hardware_domain(d) ? opt_xpti_hwdom : opt_xpti_domu; + + d->arch.asi = is_hardware_domain(d) ? opt_asi_hwdom + : pv ? opt_asi_pv : opt_asi_hvm; } void __init init_speculation_mitigations(void) @@ -XXX,XX +XXX,XX @@ void __init init_speculation_mitigations(void) hw_smt_enabled && default_xen_spec_ctrl ) setup_force_cpu_cap(X86_FEATURE_SC_MSR_IDLE); + /* Disable ASI by default until feature is finished. */ + if ( opt_asi_pv == -1 ) + opt_asi_pv = 0; + if ( opt_asi_hwdom == -1 ) + opt_asi_hwdom = 0; + if ( opt_asi_hvm == -1 ) + opt_asi_hvm = 0; + + if ( opt_asi_pv || opt_asi_hvm ) + warning_add( + "Address Space Isolation is not functional, this option is\n" + "intended to be used only for development purposes.\n"); + xpti_init_default(); l1tf_calculations(); -- 2.45.2
Instead of allocating a monitor table for each vCPU when running in HVM HAP mode, use a per-pCPU monitor table, which gets the per-domain slot updated on guest context switch. This limits the amount of memory used for HVM HAP monitor tables to the amount of active pCPUs, rather than to the number of vCPUs. It also simplifies vCPU allocation and teardown, since the monitor table handling is removed from there. Note the switch to using a per-CPU monitor table is done regardless of whether Address Space Isolation is enabled or not. Partly for the memory usage reduction, and also because it allows to simplify the VM tear down path by not having to cleanup the per-vCPU monitor tables. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- Note the monitor table is not made static because uses outside of the file where it's defined will be added by further patches. --- xen/arch/x86/hvm/hvm.c | 60 ++++++++++++++++++++++++ xen/arch/x86/hvm/svm/svm.c | 5 ++ xen/arch/x86/hvm/vmx/vmcs.c | 1 + xen/arch/x86/hvm/vmx/vmx.c | 4 ++ xen/arch/x86/include/asm/hap.h | 1 - xen/arch/x86/include/asm/hvm/hvm.h | 8 ++++ xen/arch/x86/mm.c | 8 ++++ xen/arch/x86/mm/hap/hap.c | 75 ------------------------------ xen/arch/x86/mm/paging.c | 4 +- 9 files changed, 87 insertions(+), 79 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -XXX,XX +XXX,XX @@ static const char __initconst warning_hvm_fep[] = static bool __initdata opt_altp2m_enabled; boolean_param("altp2m", opt_altp2m_enabled); +DEFINE_PER_CPU(root_pgentry_t *, monitor_pgt); + +static int allocate_cpu_monitor_table(unsigned int cpu) +{ + root_pgentry_t *pgt = alloc_xenheap_page(); + + if ( !pgt ) + return -ENOMEM; + + clear_page(pgt); + + init_xen_l4_slots(pgt, _mfn(virt_to_mfn(pgt)), INVALID_MFN, NULL, + false, true, false); + + ASSERT(!per_cpu(monitor_pgt, cpu)); + per_cpu(monitor_pgt, cpu) = pgt; + + return 0; +} + +static void free_cpu_monitor_table(unsigned int cpu) +{ + root_pgentry_t *pgt = per_cpu(monitor_pgt, cpu); + + if ( !pgt ) + return; + + per_cpu(monitor_pgt, cpu) = NULL; + free_xenheap_page(pgt); +} + +void hvm_set_cpu_monitor_table(struct vcpu *v) +{ + root_pgentry_t *pgt = this_cpu(monitor_pgt); + + ASSERT(pgt); + + setup_perdomain_slot(v, pgt); + + make_cr3(v, _mfn(virt_to_mfn(pgt))); +} + +void hvm_clear_cpu_monitor_table(struct vcpu *v) +{ + /* Poison %cr3, it will be updated when the vCPU is scheduled. */ + make_cr3(v, INVALID_MFN); +} + static int cf_check cpu_callback( struct notifier_block *nfb, unsigned long action, void *hcpu) { @@ -XXX,XX +XXX,XX @@ static int cf_check cpu_callback( switch ( action ) { case CPU_UP_PREPARE: + rc = allocate_cpu_monitor_table(cpu); + if ( rc ) + break; rc = alternative_call(hvm_funcs.cpu_up_prepare, cpu); break; case CPU_DYING: @@ -XXX,XX +XXX,XX @@ static int cf_check cpu_callback( case CPU_UP_CANCELED: case CPU_DEAD: alternative_vcall(hvm_funcs.cpu_dead, cpu); + free_cpu_monitor_table(cpu); break; default: break; @@ -XXX,XX +XXX,XX @@ static bool __init hap_supported(struct hvm_function_table *fns) static int __init cf_check hvm_enable(void) { const struct hvm_function_table *fns = NULL; + int rc; if ( cpu_has_vmx ) fns = start_vmx(); @@ -XXX,XX +XXX,XX @@ static int __init cf_check hvm_enable(void) register_cpu_notifier(&cpu_nfb); + rc = allocate_cpu_monitor_table(0); + if ( rc ) + { + printk(XENLOG_ERR "Error %d setting up HVM monitor page tables\n", rc); + return rc; + } + return 0; } presmp_initcall(hvm_enable); diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/hvm/svm/svm.c +++ b/xen/arch/x86/hvm/svm/svm.c @@ -XXX,XX +XXX,XX @@ static void cf_check svm_ctxt_switch_from(struct vcpu *v) if ( unlikely((read_efer() & EFER_SVME) == 0) ) return; + hvm_clear_cpu_monitor_table(v); + if ( !v->arch.fully_eager_fpu ) svm_fpu_leave(v); @@ -XXX,XX +XXX,XX @@ static void cf_check svm_ctxt_switch_to(struct vcpu *v) ASSERT(v->domain->arch.cpuid->extd.virt_ssbd); amd_set_legacy_ssbd(true); } + + hvm_set_cpu_monitor_table(v); } static void noreturn cf_check svm_do_resume(void) @@ -XXX,XX +XXX,XX @@ static void noreturn cf_check svm_do_resume(void) hvm_migrate_pirqs(v); /* Migrating to another ASID domain. Request a new ASID. */ hvm_asid_flush_vcpu(v); + hvm_update_host_cr3(v); } if ( !vcpu_guestmode && !vlapic_hw_disabled(vlapic) ) diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/hvm/vmx/vmcs.c +++ b/xen/arch/x86/hvm/vmx/vmcs.c @@ -XXX,XX +XXX,XX @@ void cf_check vmx_do_resume(void) v->arch.hvm.vmx.hostenv_migrated = 1; hvm_asid_flush_vcpu(v); + hvm_update_host_cr3(v); } debug_state = v->domain->debugger_attached diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/hvm/vmx/vmx.c +++ b/xen/arch/x86/hvm/vmx/vmx.c @@ -XXX,XX +XXX,XX @@ static void cf_check vmx_ctxt_switch_from(struct vcpu *v) if ( unlikely(!this_cpu(vmxon)) ) return; + hvm_clear_cpu_monitor_table(v); + if ( !v->is_running ) { /* @@ -XXX,XX +XXX,XX @@ static void cf_check vmx_ctxt_switch_to(struct vcpu *v) if ( v->domain->arch.hvm.pi_ops.flags & PI_CSW_TO ) vmx_pi_switch_to(v); + + hvm_set_cpu_monitor_table(v); } diff --git a/xen/arch/x86/include/asm/hap.h b/xen/arch/x86/include/asm/hap.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/hap.h +++ b/xen/arch/x86/include/asm/hap.h @@ -XXX,XX +XXX,XX @@ int hap_domctl(struct domain *d, struct xen_domctl_shadow_op *sc, XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl); int hap_enable(struct domain *d, u32 mode); void hap_final_teardown(struct domain *d); -void hap_vcpu_teardown(struct vcpu *v); void hap_teardown(struct domain *d, bool *preempted); void hap_vcpu_init(struct vcpu *v); int hap_track_dirty_vram(struct domain *d, diff --git a/xen/arch/x86/include/asm/hvm/hvm.h b/xen/arch/x86/include/asm/hvm/hvm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/hvm/hvm.h +++ b/xen/arch/x86/include/asm/hvm/hvm.h @@ -XXX,XX +XXX,XX @@ static inline void hvm_invlpg(struct vcpu *v, unsigned long linear) (1U << X86_EXC_AC) | \ (1U << X86_EXC_MC)) +/* + * Setup the per-domain slots of the per-cpu monitor table and update the vCPU + * cr3 to use it. + */ +DECLARE_PER_CPU(root_pgentry_t *, monitor_pgt); +void hvm_set_cpu_monitor_table(struct vcpu *v); +void hvm_clear_cpu_monitor_table(struct vcpu *v); + /* Called in boot/resume paths. Must cope with no HVM support. */ static inline int hvm_cpu_up(void) { diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt) l4e_write(&root_pgt[root_table_offset(PERDOMAIN_VIRT_START)], l4e_from_page(v->domain->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW)); + + if ( !is_pv_64bit_vcpu(v) ) + /* + * HVM guests always have the compatibility L4 per-domain area because + * bitness is not know, and can change at runtime. + */ + l4e_write(&root_pgt[root_table_offset(PERDOMAIN_ALT_VIRT_START)], + root_pgt[root_table_offset(PERDOMAIN_VIRT_START)]); } static void __init __maybe_unused build_assertions(void) diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/hap/hap.c +++ b/xen/arch/x86/mm/hap/hap.c @@ -XXX,XX +XXX,XX @@ int hap_set_allocation(struct domain *d, unsigned int pages, bool *preempted) return 0; } -static mfn_t hap_make_monitor_table(struct vcpu *v) -{ - struct domain *d = v->domain; - struct page_info *pg; - l4_pgentry_t *l4e; - mfn_t m4mfn; - - ASSERT(pagetable_get_pfn(v->arch.hvm.monitor_table) == 0); - - if ( (pg = hap_alloc(d)) == NULL ) - goto oom; - - m4mfn = page_to_mfn(pg); - l4e = map_domain_page(m4mfn); - - init_xen_l4_slots(l4e, m4mfn, INVALID_MFN, d->arch.perdomain_l3_pg, - false, true, false); - unmap_domain_page(l4e); - - return m4mfn; - - oom: - if ( !d->is_dying && - (!d->is_shutting_down || d->shutdown_code != SHUTDOWN_crash) ) - { - printk(XENLOG_G_ERR "%pd: out of memory building monitor pagetable\n", - d); - domain_crash(d); - } - return INVALID_MFN; -} - -static void hap_destroy_monitor_table(struct vcpu* v, mfn_t mmfn) -{ - struct domain *d = v->domain; - - /* Put the memory back in the pool */ - hap_free(d, mmfn); -} - /************************************************/ /* HAP DOMAIN LEVEL FUNCTIONS */ /************************************************/ @@ -XXX,XX +XXX,XX @@ void hap_final_teardown(struct domain *d) } } -void hap_vcpu_teardown(struct vcpu *v) -{ - struct domain *d = v->domain; - mfn_t mfn; - - paging_lock(d); - - if ( !paging_mode_hap(d) || !v->arch.paging.mode ) - goto out; - - mfn = pagetable_get_mfn(v->arch.hvm.monitor_table); - if ( mfn_x(mfn) ) - hap_destroy_monitor_table(v, mfn); - v->arch.hvm.monitor_table = pagetable_null(); - - out: - paging_unlock(d); -} - void hap_teardown(struct domain *d, bool *preempted) { struct vcpu *v; @@ -XXX,XX +XXX,XX @@ void hap_teardown(struct domain *d, bool *preempted) ASSERT(d->is_dying); ASSERT(d != current->domain); - /* TODO - Remove when the teardown path is better structured. */ - for_each_vcpu ( d, v ) - hap_vcpu_teardown(v); - /* Leave the root pt in case we get further attempts to modify the p2m. */ if ( hvm_altp2m_supported() ) { @@ -XXX,XX +XXX,XX @@ static void cf_check hap_update_paging_modes(struct vcpu *v) v->arch.paging.mode = hap_paging_get_mode(v); - if ( pagetable_is_null(v->arch.hvm.monitor_table) ) - { - mfn_t mmfn = hap_make_monitor_table(v); - - if ( mfn_eq(mmfn, INVALID_MFN) ) - goto unlock; - v->arch.hvm.monitor_table = pagetable_from_mfn(mmfn); - make_cr3(v, mmfn); - hvm_update_host_cr3(v); - } - /* CR3 is effectively updated by a mode change. Flush ASIDs, etc. */ hap_update_cr3(v, false); - unlock: paging_unlock(d); put_gfn(d, cr3_gfn); } diff --git a/xen/arch/x86/mm/paging.c b/xen/arch/x86/mm/paging.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/paging.c +++ b/xen/arch/x86/mm/paging.c @@ -XXX,XX +XXX,XX @@ long do_paging_domctl_cont( void paging_vcpu_teardown(struct vcpu *v) { - if ( hap_enabled(v->domain) ) - hap_vcpu_teardown(v); - else + if ( !hap_enabled(v->domain) ) shadow_vcpu_teardown(v); } -- 2.45.2
Instead of allocating a monitor table for each vCPU when running in HVM shadow mode, use a per-pCPU monitor table, which gets the per-domain slot updated on guest context switch. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- I've tested this manually, but XenServer builds disable shadow support, so it possibly hasn't been given the same level of testing as the rest of the changes. --- xen/arch/x86/hvm/hvm.c | 7 +++ xen/arch/x86/include/asm/hvm/vcpu.h | 6 ++- xen/arch/x86/include/asm/paging.h | 18 ++++++++ xen/arch/x86/mm.c | 6 +++ xen/arch/x86/mm/shadow/common.c | 42 +++++++----------- xen/arch/x86/mm/shadow/hvm.c | 65 ++++++++++++---------------- xen/arch/x86/mm/shadow/multi.c | 66 ++++++++++++++++++----------- xen/arch/x86/mm/shadow/private.h | 4 +- 8 files changed, 120 insertions(+), 94 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -XXX,XX +XXX,XX @@ void hvm_set_cpu_monitor_table(struct vcpu *v) ASSERT(pgt); + paging_set_cpu_monitor_table(v); setup_perdomain_slot(v, pgt); make_cr3(v, _mfn(virt_to_mfn(pgt))); @@ -XXX,XX +XXX,XX @@ void hvm_clear_cpu_monitor_table(struct vcpu *v) { /* Poison %cr3, it will be updated when the vCPU is scheduled. */ make_cr3(v, INVALID_MFN); + + paging_clear_cpu_monitor_table(v); } static int cf_check cpu_callback( @@ -XXX,XX +XXX,XX @@ int hvm_vcpu_initialise(struct vcpu *v) int rc; struct domain *d = v->domain; +#ifdef CONFIG_SHADOW_PAGING + v->arch.hvm.shadow_linear_l3 = INVALID_MFN; +#endif + hvm_asid_flush_vcpu(v); spin_lock_init(&v->arch.hvm.tm_lock); diff --git a/xen/arch/x86/include/asm/hvm/vcpu.h b/xen/arch/x86/include/asm/hvm/vcpu.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/hvm/vcpu.h +++ b/xen/arch/x86/include/asm/hvm/vcpu.h @@ -XXX,XX +XXX,XX @@ struct hvm_vcpu { uint16_t p2midx; } fast_single_step; - /* (MFN) hypervisor page table */ - pagetable_t monitor_table; +#ifdef CONFIG_SHADOW_PAGING + /* Reference to the linear L3 page table. */ + mfn_t shadow_linear_l3; +#endif struct hvm_vcpu_asid n1asid; diff --git a/xen/arch/x86/include/asm/paging.h b/xen/arch/x86/include/asm/paging.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/paging.h +++ b/xen/arch/x86/include/asm/paging.h @@ -XXX,XX +XXX,XX @@ struct paging_mode { unsigned long cr3, paddr_t ga, uint32_t *pfec, unsigned int *page_order); + void (*set_cpu_monitor_table )(struct vcpu *v); + void (*clear_cpu_monitor_table)(struct vcpu *v); #endif pagetable_t (*update_cr3 )(struct vcpu *v, bool noflush); @@ -XXX,XX +XXX,XX @@ static inline bool paging_flush_tlb(const unsigned long *vcpu_bitmap) return current->domain->arch.paging.flush_tlb(vcpu_bitmap); } +static inline void paging_set_cpu_monitor_table(struct vcpu *v) +{ + const struct paging_mode *mode = paging_get_hostmode(v); + + if ( mode->set_cpu_monitor_table ) + mode->set_cpu_monitor_table(v); +} + +static inline void paging_clear_cpu_monitor_table(struct vcpu *v) +{ + const struct paging_mode *mode = paging_get_hostmode(v); + + if ( mode->clear_cpu_monitor_table ) + mode->clear_cpu_monitor_table(v); +} + #endif /* CONFIG_HVM */ /* Update all the things that are derived from the guest's CR3. diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ void write_ptbase(struct vcpu *v) } else { + ASSERT(!is_hvm_domain(d) || !d->arch.asi +#ifdef CONFIG_HVM + || mfn_eq(maddr_to_mfn(v->arch.cr3), + virt_to_mfn(this_cpu(monitor_pgt))) +#endif + ); /* Make sure to clear use_pv_cr3 and xen_cr3 before pv_cr3. */ cpu_info->use_pv_cr3 = false; cpu_info->xen_cr3 = 0; diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/shadow/common.c +++ b/xen/arch/x86/mm/shadow/common.c @@ -XXX,XX +XXX,XX @@ static void sh_update_paging_modes(struct vcpu *v) &SHADOW_INTERNAL_NAME(sh_paging_mode, 2); } - if ( pagetable_is_null(v->arch.hvm.monitor_table) ) + if ( mfn_eq(v->arch.hvm.shadow_linear_l3, INVALID_MFN) ) { - mfn_t mmfn = sh_make_monitor_table( - v, v->arch.paging.mode->shadow.shadow_levels); - - if ( mfn_eq(mmfn, INVALID_MFN) ) + if ( sh_update_monitor_table( + v, v->arch.paging.mode->shadow.shadow_levels) ) return; - v->arch.hvm.monitor_table = pagetable_from_mfn(mmfn); - make_cr3(v, mmfn); hvm_update_host_cr3(v); } @@ -XXX,XX +XXX,XX @@ static void sh_update_paging_modes(struct vcpu *v) (v->arch.paging.mode->shadow.shadow_levels != old_mode->shadow.shadow_levels) ) { - /* Need to make a new monitor table for the new mode */ - mfn_t new_mfn, old_mfn; + /* Might need to make a new L3 linear table for the new mode */ + mfn_t old_mfn; if ( v != current && vcpu_runnable(v) ) { @@ -XXX,XX +XXX,XX @@ static void sh_update_paging_modes(struct vcpu *v) return; } - old_mfn = pagetable_get_mfn(v->arch.hvm.monitor_table); - v->arch.hvm.monitor_table = pagetable_null(); - new_mfn = sh_make_monitor_table( - v, v->arch.paging.mode->shadow.shadow_levels); - if ( mfn_eq(new_mfn, INVALID_MFN) ) + old_mfn = v->arch.hvm.shadow_linear_l3; + v->arch.hvm.shadow_linear_l3 = INVALID_MFN; + if ( sh_update_monitor_table( + v, v->arch.paging.mode->shadow.shadow_levels) ) { sh_destroy_monitor_table(v, old_mfn, old_mode->shadow.shadow_levels); return; } - v->arch.hvm.monitor_table = pagetable_from_mfn(new_mfn); - SHADOW_PRINTK("new monitor table %"PRI_mfn "\n", - mfn_x(new_mfn)); + SHADOW_PRINTK("new L3 linear table %"PRI_mfn "\n", + mfn_x(v->arch.hvm.shadow_linear_l3)); /* Don't be running on the old monitor table when we * pull it down! Switch CR3, and warn the HVM code that * its host cr3 has changed. */ - make_cr3(v, new_mfn); if ( v == current ) write_ptbase(v); hvm_update_host_cr3(v); @@ -XXX,XX +XXX,XX @@ void shadow_vcpu_teardown(struct vcpu *v) sh_detach_old_tables(v); #ifdef CONFIG_HVM - if ( shadow_mode_external(d) ) + if ( shadow_mode_external(d) && + !mfn_eq(v->arch.hvm.shadow_linear_l3, INVALID_MFN) ) { - mfn_t mfn = pagetable_get_mfn(v->arch.hvm.monitor_table); - - if ( mfn_x(mfn) ) - sh_destroy_monitor_table( - v, mfn, + sh_destroy_monitor_table( + v, v->arch.hvm.shadow_linear_l3, v->arch.paging.mode->shadow.shadow_levels); - - v->arch.hvm.monitor_table = pagetable_null(); + v->arch.hvm.shadow_linear_l3 = INVALID_MFN; } #endif diff --git a/xen/arch/x86/mm/shadow/hvm.c b/xen/arch/x86/mm/shadow/hvm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/shadow/hvm.c +++ b/xen/arch/x86/mm/shadow/hvm.c @@ -XXX,XX +XXX,XX @@ bool cf_check shadow_flush_tlb(const unsigned long *vcpu_bitmap) return true; } -mfn_t sh_make_monitor_table(const struct vcpu *v, unsigned int shadow_levels) +int sh_update_monitor_table(struct vcpu *v, unsigned int shadow_levels) { struct domain *d = v->domain; - mfn_t m4mfn; - l4_pgentry_t *l4e; - ASSERT(!pagetable_get_pfn(v->arch.hvm.monitor_table)); + ASSERT(mfn_eq(v->arch.hvm.shadow_linear_l3, INVALID_MFN)); /* Guarantee we can get the memory we need */ - if ( !shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS) ) - return INVALID_MFN; - - m4mfn = shadow_alloc(d, SH_type_monitor_table, 0); - mfn_to_page(m4mfn)->shadow_flags = 4; - - l4e = map_domain_page(m4mfn); - - /* - * Create a self-linear mapping, but no shadow-linear mapping. A - * shadow-linear mapping will either be inserted below when creating - * lower level monitor tables, or later in sh_update_cr3(). - */ - init_xen_l4_slots(l4e, m4mfn, INVALID_MFN, d->arch.perdomain_l3_pg, - false, true, false); + if ( !shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS - 1) ) + return -ENOMEM; if ( shadow_levels < 4 ) { @@ -XXX,XX +XXX,XX @@ mfn_t sh_make_monitor_table(const struct vcpu *v, unsigned int shadow_levels) */ m3mfn = shadow_alloc(d, SH_type_monitor_table, 0); mfn_to_page(m3mfn)->shadow_flags = 3; - l4e[l4_table_offset(SH_LINEAR_PT_VIRT_START)] - = l4e_from_mfn(m3mfn, __PAGE_HYPERVISOR_RW); m2mfn = shadow_alloc(d, SH_type_monitor_table, 0); mfn_to_page(m2mfn)->shadow_flags = 2; l3e = map_domain_page(m3mfn); l3e[0] = l3e_from_mfn(m2mfn, __PAGE_HYPERVISOR_RW); unmap_domain_page(l3e); - } - unmap_domain_page(l4e); + v->arch.hvm.shadow_linear_l3 = m3mfn; + + /* + * If the vCPU is not the current one the L4 entry will be updated on + * context switch. + */ + if ( v == current ) + this_cpu(monitor_pgt)[l4_table_offset(SH_LINEAR_PT_VIRT_START)] + = l4e_from_mfn(m3mfn, __PAGE_HYPERVISOR_RW); + } + else if ( v == current ) + /* The shadow linear mapping will be inserted in sh_update_cr3(). */ + this_cpu(monitor_pgt)[l4_table_offset(SH_LINEAR_PT_VIRT_START)] + = l4e_empty(); - return m4mfn; + return 0; } -void sh_destroy_monitor_table(const struct vcpu *v, mfn_t mmfn, +void sh_destroy_monitor_table(const struct vcpu *v, mfn_t m3mfn, unsigned int shadow_levels) { struct domain *d = v->domain; - ASSERT(mfn_to_page(mmfn)->u.sh.type == SH_type_monitor_table); - if ( shadow_levels < 4 ) { - mfn_t m3mfn; - l4_pgentry_t *l4e = map_domain_page(mmfn); - l3_pgentry_t *l3e; - unsigned int linear_slot = l4_table_offset(SH_LINEAR_PT_VIRT_START); + l3_pgentry_t *l3e = map_domain_page(m3mfn); + + ASSERT(!mfn_eq(m3mfn, INVALID_MFN)); + ASSERT(mfn_to_page(m3mfn)->u.sh.type == SH_type_monitor_table); /* * Need to destroy the l3 and l2 monitor pages used * for the linear map. */ - ASSERT(l4e_get_flags(l4e[linear_slot]) & _PAGE_PRESENT); - m3mfn = l4e_get_mfn(l4e[linear_slot]); - l3e = map_domain_page(m3mfn); ASSERT(l3e_get_flags(l3e[0]) & _PAGE_PRESENT); shadow_free(d, l3e_get_mfn(l3e[0])); unmap_domain_page(l3e); shadow_free(d, m3mfn); - - unmap_domain_page(l4e); } - - /* Put the memory back in the pool */ - shadow_free(d, mmfn); + else + ASSERT(mfn_eq(m3mfn, INVALID_MFN)); } /**************************************************************************/ diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/shadow/multi.c +++ b/xen/arch/x86/mm/shadow/multi.c @@ -XXX,XX +XXX,XX @@ static unsigned long cf_check sh_gva_to_gfn( return gfn_x(gfn); } +static void cf_check set_cpu_monitor_table(struct vcpu *v) +{ + root_pgentry_t *pgt = this_cpu(monitor_pgt); + + virt_to_page(pgt)->shadow_flags = 4; + + /* Setup linear L3 entry. */ + if ( !mfn_eq(v->arch.hvm.shadow_linear_l3, INVALID_MFN) ) + pgt[l4_table_offset(SH_LINEAR_PT_VIRT_START)] = + l4e_from_mfn(v->arch.hvm.shadow_linear_l3, __PAGE_HYPERVISOR_RW); + else + pgt[l4_table_offset(SH_LINEAR_PT_VIRT_START)] = + l4e_from_pfn( + pagetable_get_pfn(v->arch.paging.shadow.shadow_table[0]), + __PAGE_HYPERVISOR_RW); +} + +static void cf_check clear_cpu_monitor_table(struct vcpu *v) +{ + root_pgentry_t *pgt = this_cpu(monitor_pgt); + + virt_to_page(pgt)->shadow_flags = 0; + + pgt[l4_table_offset(SH_LINEAR_PT_VIRT_START)] = l4e_empty(); +} + #endif /* CONFIG_HVM */ static inline void @@ -XXX,XX +XXX,XX @@ sh_update_linear_entries(struct vcpu *v) */ /* Don't try to update the monitor table if it doesn't exist */ - if ( !shadow_mode_external(d) || - pagetable_get_pfn(v->arch.hvm.monitor_table) == 0 ) + if ( !shadow_mode_external(d) +#if SHADOW_PAGING_LEVELS == 3 + || mfn_eq(v->arch.hvm.shadow_linear_l3, INVALID_MFN) +#endif + ) return; #if !defined(CONFIG_HVM) @@ -XXX,XX +XXX,XX @@ sh_update_linear_entries(struct vcpu *v) pagetable_get_pfn(v->arch.paging.shadow.shadow_table[0]), __PAGE_HYPERVISOR_RW); } - else - { - l4_pgentry_t *ml4e; - - ml4e = map_domain_page(pagetable_get_mfn(v->arch.hvm.monitor_table)); - ml4e[l4_table_offset(SH_LINEAR_PT_VIRT_START)] = - l4e_from_pfn( - pagetable_get_pfn(v->arch.paging.shadow.shadow_table[0]), - __PAGE_HYPERVISOR_RW); - unmap_domain_page(ml4e); - } #elif SHADOW_PAGING_LEVELS == 3 @@ -XXX,XX +XXX,XX @@ sh_update_linear_entries(struct vcpu *v) + l2_linear_offset(SH_LINEAR_PT_VIRT_START); else { - mfn_t l3mfn, l2mfn; - l4_pgentry_t *ml4e; - l3_pgentry_t *ml3e; - int linear_slot = shadow_l4_table_offset(SH_LINEAR_PT_VIRT_START); - ml4e = map_domain_page(pagetable_get_mfn(v->arch.hvm.monitor_table)); - - ASSERT(l4e_get_flags(ml4e[linear_slot]) & _PAGE_PRESENT); - l3mfn = l4e_get_mfn(ml4e[linear_slot]); - ml3e = map_domain_page(l3mfn); - unmap_domain_page(ml4e); + mfn_t l2mfn; + l3_pgentry_t *ml3e = map_domain_page(v->arch.hvm.shadow_linear_l3); ASSERT(l3e_get_flags(ml3e[0]) & _PAGE_PRESENT); l2mfn = l3e_get_mfn(ml3e[0]); @@ -XXX,XX +XXX,XX @@ static pagetable_t cf_check sh_update_cr3(struct vcpu *v, bool noflush) /// /// v->arch.cr3 /// - if ( shadow_mode_external(d) ) + if ( shadow_mode_external(d) && v == current ) { - make_cr3(v, pagetable_get_mfn(v->arch.hvm.monitor_table)); +#ifdef CONFIG_HVM + make_cr3(v, _mfn(virt_to_mfn(this_cpu(monitor_pgt)))); +#else + ASSERT_UNREACHABLE(); +#endif } #if SHADOW_PAGING_LEVELS == 4 else // not shadow_mode_external... @@ -XXX,XX +XXX,XX @@ const struct paging_mode sh_paging_mode = { .invlpg = sh_invlpg, #ifdef CONFIG_HVM .gva_to_gfn = sh_gva_to_gfn, + .set_cpu_monitor_table = set_cpu_monitor_table, + .clear_cpu_monitor_table = clear_cpu_monitor_table, #endif .update_cr3 = sh_update_cr3, .guest_levels = GUEST_PAGING_LEVELS, diff --git a/xen/arch/x86/mm/shadow/private.h b/xen/arch/x86/mm/shadow/private.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/shadow/private.h +++ b/xen/arch/x86/mm/shadow/private.h @@ -XXX,XX +XXX,XX @@ void shadow_unhook_mappings(struct domain *d, mfn_t smfn, int user_only); * sh_{make,destroy}_monitor_table() depend only on the number of shadow * levels. */ -mfn_t sh_make_monitor_table(const struct vcpu *v, unsigned int shadow_levels); -void sh_destroy_monitor_table(const struct vcpu *v, mfn_t mmfn, +int sh_update_monitor_table(struct vcpu *v, unsigned int shadow_levels); +void sh_destroy_monitor_table(const struct vcpu *v, mfn_t m3mfn, unsigned int shadow_levels); /* VRAM dirty tracking helpers. */ -- 2.45.2
Introduce support for possibly using a different L4 across the idle vCPUs. This change only introduces support for loading a per-pPCU idle L4, but even with the per-CPU idle page-table enabled it should still be a clone of idle_pg_table, hence no functional change expected. Note the idle L4 is not changed after Xen has reached the SYS_STATE_smp_boot state, hence there are no need to synchronize the contents of the L4 once the CPUs are started. Using a per-CPU idle page-table is not strictly required for the Address Space Isolation work, as idle page tables are never used when running guests. However it simplifies memory management of the per-CPU mappings, as creating per-CPU mappings only require using the idle page-table of the CPU where the mappings should be created. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/boot/x86_64.S | 11 +++++++++++ xen/arch/x86/domain.c | 20 +++++++++++++++++++- xen/arch/x86/domain_page.c | 2 +- xen/arch/x86/include/asm/setup.h | 1 + xen/arch/x86/setup.c | 3 +++ xen/arch/x86/smpboot.c | 7 +++++++ 6 files changed, 42 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/boot/x86_64.S b/xen/arch/x86/boot/x86_64.S index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/boot/x86_64.S +++ b/xen/arch/x86/boot/x86_64.S @@ -XXX,XX +XXX,XX @@ ENTRY(__high_start) mov $XEN_MINIMAL_CR4,%rcx mov %rcx,%cr4 + /* + * Possibly switch to the per-CPU idle page-tables. Note we cannot + * switch earlier as the per-CPU page-tables might be above 4G, and + * hence need to load them from 64bit code. + */ + mov ap_cr3(%rip), %rax + test %rax, %rax + jz .L_skip_cr3 + mov %rax, %cr3 +.L_skip_cr3: + mov stack_start(%rip),%rsp /* Reset EFLAGS (subsumes CLI and CLD). */ diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -XXX,XX +XXX,XX @@ void arch_vcpu_regs_init(struct vcpu *v) int arch_vcpu_create(struct vcpu *v) { struct domain *d = v->domain; + root_pgentry_t *pgt = NULL; int rc; v->arch.flags = TF_kernel_mode; @@ -XXX,XX +XXX,XX @@ int arch_vcpu_create(struct vcpu *v) else { /* Idle domain */ - v->arch.cr3 = __pa(idle_pg_table); + if ( (opt_asi_pv || opt_asi_hvm) && v->vcpu_id ) + { + pgt = alloc_xenheap_page(); + + /* + * For the idle vCPU 0 (the BSP idle vCPU) use idle_pg_table + * directly, there's no need to create yet another copy. + */ + rc = -ENOMEM; + if ( !pgt ) + goto fail; + + copy_page(pgt, idle_pg_table); + v->arch.cr3 = __pa(pgt); + } + else + v->arch.cr3 = __pa(idle_pg_table); rc = 0; v->arch.msrs = ZERO_BLOCK_PTR; /* Catch stray misuses */ } @@ -XXX,XX +XXX,XX @@ int arch_vcpu_create(struct vcpu *v) vcpu_destroy_fpu(v); xfree(v->arch.msrs); v->arch.msrs = NULL; + free_xenheap_page(pgt); return rc; } diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -XXX,XX +XXX,XX @@ static inline struct vcpu *mapcache_current_vcpu(void) if ( (v = idle_vcpu[smp_processor_id()]) == current ) sync_local_execstate(); /* We must now be running on the idle page table. */ - ASSERT(cr3_pa(read_cr3()) == __pa(idle_pg_table)); + ASSERT(cr3_pa(read_cr3()) == cr3_pa(v->arch.cr3)); } return v; diff --git a/xen/arch/x86/include/asm/setup.h b/xen/arch/x86/include/asm/setup.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/setup.h +++ b/xen/arch/x86/include/asm/setup.h @@ -XXX,XX +XXX,XX @@ extern unsigned long xenheap_initial_phys_start; extern uint64_t boot_tsc_stamp; extern void *stack_start; +extern unsigned long ap_cr3; void early_cpu_init(bool verbose); void early_time_init(void); diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -XXX,XX +XXX,XX @@ char asmlinkage __section(".init.bss.stack_aligned") __aligned(STACK_SIZE) /* Used by the BSP/AP paths to find the higher half stack mapping to use. */ void *stack_start = cpu0_stack + STACK_SIZE - sizeof(struct cpu_info); +/* cr3 value for the AP to load on boot. */ +unsigned long ap_cr3; + /* Used by the boot asm to stash the relocated multiboot info pointer. */ unsigned int asmlinkage __initdata multiboot_ptr; diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -XXX,XX +XXX,XX @@ static int do_boot_cpu(int apicid, int cpu) stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info); + /* + * If per-CPU idle root page table has been allocated, switch to it as + * part of the AP bringup trampoline. + */ + ap_cr3 = idle_vcpu[cpu]->arch.cr3 != __pa(idle_pg_table) ? + idle_vcpu[cpu]->arch.cr3 : 0; + /* This grunge runs the startup process for the targeted processor. */ set_cpu_state(CPU_STATE_INIT); -- 2.45.2
So far L4 slot 260 has always been per-domain, in other words: all vCPUs of a domain share the same L3 entry. Currently only 3 slots are used in that L3 table, which leaves plenty of room. Introduce a per-CPU L3 that's used the the domain has Address Space Isolation enabled. Such per-CPU L3 gets currently populated using the same L3 entries present on the per-domain L3 (d->arch.perdomain_l3_pg). No functional change expected, as the per-CPU L3 is always a copy of the contents of d->arch.perdomain_l3_pg. Note that all the per-domain L3 entries are populated at domain create, and hence there's no need to sync the state of the per-CPU L3 as the domain won't yet be running when the L3 is modified. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/domain.h | 2 + xen/arch/x86/include/asm/mm.h | 4 ++ xen/arch/x86/mm.c | 80 +++++++++++++++++++++++++++++-- xen/arch/x86/setup.c | 8 ++++ xen/arch/x86/smpboot.c | 4 ++ 5 files changed, 95 insertions(+), 3 deletions(-) diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct arch_domain { struct page_info *perdomain_l3_pg; + struct page_info *perdomain_l2_pgs[PERDOMAIN_SLOTS]; + #ifdef CONFIG_PV32 unsigned int hv_compat_vstart; #endif diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -XXX,XX +XXX,XX @@ static inline bool arch_mfns_in_directmap(unsigned long mfn, unsigned long nr) /* Setup the per-domain slot in the root page table pointer. */ void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt); +/* Allocate a per-CPU local L3 table to use in the per-domain slot. */ +int allocate_perdomain_local_l3(unsigned int cpu); +void free_perdomain_local_l3(unsigned int cpu); + #endif /* __ASM_X86_MM_H__ */ diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ int create_perdomain_mapping(struct domain *d, unsigned long va, l2tab = __map_domain_page(pg); clear_page(l2tab); l3tab[l3_table_offset(va)] = l3e_from_page(pg, __PAGE_HYPERVISOR_RW); + /* + * Keep a reference to the per-domain L3 entries in case a per-CPU L3 + * is in use (as opposed to using perdomain_l3_pg). + */ + ASSERT(!d->creation_finished); + d->arch.perdomain_l2_pgs[l3_table_offset(va)] = pg; } else l2tab = map_l2t_from_l3e(l3tab[l3_table_offset(va)]); @@ -XXX,XX +XXX,XX @@ unsigned long get_upper_mfn_bound(void) return min(max_mfn, 1UL << (paddr_bits - PAGE_SHIFT)) - 1; } +static DEFINE_PER_CPU(l3_pgentry_t *, local_l3); + +static void populate_perdomain(const struct domain *d, l4_pgentry_t *l4, + l3_pgentry_t *l3) +{ + unsigned int i; + + /* Populate the per-CPU L3 with the per-domain entries. */ + for ( i = 0; i < ARRAY_SIZE(d->arch.perdomain_l2_pgs); i++ ) + { + const struct page_info *pg = d->arch.perdomain_l2_pgs[i]; + + BUILD_BUG_ON(ARRAY_SIZE(d->arch.perdomain_l2_pgs) > + L3_PAGETABLE_ENTRIES); + l3e_write(&l3[i], pg ? l3e_from_page(pg, __PAGE_HYPERVISOR_RW) + : l3e_empty()); + } + + l4e_write(&l4[l4_table_offset(PERDOMAIN_VIRT_START)], + l4e_from_mfn(virt_to_mfn(l3), __PAGE_HYPERVISOR_RW)); +} + +int allocate_perdomain_local_l3(unsigned int cpu) +{ + const struct domain *d = idle_vcpu[cpu]->domain; + l3_pgentry_t *l3; + root_pgentry_t *root_pgt = maddr_to_virt(idle_vcpu[cpu]->arch.cr3); + + ASSERT(!per_cpu(local_l3, cpu)); + + if ( !opt_asi_pv && !opt_asi_hvm ) + return 0; + + l3 = alloc_xenheap_page(); + if ( !l3 ) + return -ENOMEM; + + clear_page(l3); + + /* Setup the idle domain slots (current domain) in the L3. */ + populate_perdomain(d, root_pgt, l3); + + per_cpu(local_l3, cpu) = l3; + + return 0; +} + +void free_perdomain_local_l3(unsigned int cpu) +{ + l3_pgentry_t *l3 = per_cpu(local_l3, cpu); + + if ( !l3 ) + return; + + per_cpu(local_l3, cpu) = NULL; + free_xenheap_page(l3); +} + void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt) { - l4e_write(&root_pgt[root_table_offset(PERDOMAIN_VIRT_START)], - l4e_from_page(v->domain->arch.perdomain_l3_pg, - __PAGE_HYPERVISOR_RW)); + const struct domain *d = v->domain; + + if ( d->arch.asi ) + { + l3_pgentry_t *l3 = this_cpu(local_l3); + + ASSERT(l3); + populate_perdomain(d, root_pgt, l3); + } + else if ( is_hvm_domain(d) || d->arch.pv.xpti ) + l4e_write(&root_pgt[root_table_offset(PERDOMAIN_VIRT_START)], + l4e_from_page(v->domain->arch.perdomain_l3_pg, + __PAGE_HYPERVISOR_RW)); if ( !is_pv_64bit_vcpu(v) ) /* diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -XXX,XX +XXX,XX @@ void asmlinkage __init noreturn __start_xen(unsigned long mbi_p) alternative_branches(); + /* + * Setup the local per-domain L3 for the BSP also, so it matches the state + * of the APs. + */ + ret = allocate_perdomain_local_l3(0); + if ( ret ) + panic("Error %d setting up local per-domain L3\n", ret); + /* * NB: when running as a PV shim VCPUOP_up/down is wired to the shim * physical cpu_add/remove functions, so launch the guest with only diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -XXX,XX +XXX,XX @@ static void cpu_smpboot_free(unsigned int cpu, bool remove) } cleanup_cpu_root_pgt(cpu); + free_perdomain_local_l3(cpu); if ( per_cpu(stubs.addr, cpu) ) { @@ -XXX,XX +XXX,XX @@ static int cpu_smpboot_alloc(unsigned int cpu) per_cpu(stubs.addr, cpu) = stub_page + STUB_BUF_CPU_OFFS(cpu); rc = setup_cpu_root_pgt(cpu); + if ( rc ) + goto out; + rc = allocate_perdomain_local_l3(cpu); if ( rc ) goto out; rc = -ENOMEM; -- 2.45.2
Add logic in map_pages_to_xen() and modify_xen_mappings() so that TLB flushes are only performed locally when dealing with entries in the per-CPU area of the page-tables. No functional change intended, as there are no callers added that create or modify per-CPU mappings, nor is the per-CPU area still properly setup in the page-tables yet. Note that the removed flush_area() ended up calling flush_area_mask() through the flush_area_all() alias. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/config.h | 4 ++ xen/arch/x86/include/asm/flushtlb.h | 1 - xen/arch/x86/mm.c | 64 +++++++++++++++++++---------- 3 files changed, 47 insertions(+), 22 deletions(-) diff --git a/xen/arch/x86/include/asm/config.h b/xen/arch/x86/include/asm/config.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/config.h +++ b/xen/arch/x86/include/asm/config.h @@ -XXX,XX +XXX,XX @@ extern unsigned char boot_edid_info[128]; #define PERDOMAIN_SLOTS 3 #define PERDOMAIN_VIRT_SLOT(s) (PERDOMAIN_VIRT_START + (s) * \ (PERDOMAIN_SLOT_MBYTES << 20)) +#define PERCPU_VIRT_START PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS) +#define PERCPU_SLOTS 1 +#define PERCPU_VIRT_SLOT(s) (PERCPU_VIRT_START + (s) * \ + (PERDOMAIN_SLOT_MBYTES << 20)) /* Slot 4: mirror of per-domain mappings (for compat xlat area accesses). */ #define PERDOMAIN_ALT_VIRT_START PML4_ADDR(4) /* Slot 261: machine-to-phys conversion table (256GB). */ diff --git a/xen/arch/x86/include/asm/flushtlb.h b/xen/arch/x86/include/asm/flushtlb.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/flushtlb.h +++ b/xen/arch/x86/include/asm/flushtlb.h @@ -XXX,XX +XXX,XX @@ void flush_area_mask(const cpumask_t *mask, const void *va, #define flush_mask(mask, flags) flush_area_mask(mask, NULL, flags) /* Flush all CPUs' TLBs/caches */ -#define flush_area_all(va, flags) flush_area_mask(&cpu_online_map, va, flags) #define flush_all(flags) flush_mask(&cpu_online_map, flags) /* Flush local TLBs */ diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ static DEFINE_SPINLOCK(map_pgdir_lock); */ static l3_pgentry_t *virt_to_xen_l3e(unsigned long v) { + unsigned int cpu = smp_processor_id(); + /* Called before idle_vcpu is populated, fallback to idle_pg_table. */ + root_pgentry_t *root_pgt = idle_vcpu[cpu] ? + maddr_to_virt(idle_vcpu[cpu]->arch.cr3) : idle_pg_table; l4_pgentry_t *pl4e; - pl4e = &idle_pg_table[l4_table_offset(v)]; + pl4e = &root_pgt[l4_table_offset(v)]; if ( !(l4e_get_flags(*pl4e) & _PAGE_PRESENT) ) { bool locking = system_state > SYS_STATE_boot; @@ -XXX,XX +XXX,XX @@ static l1_pgentry_t *virt_to_xen_l1e(unsigned long v) #define l1f_to_lNf(f) (((f) & _PAGE_PRESENT) ? ((f) | _PAGE_PSE) : (f)) #define lNf_to_l1f(f) (((f) & _PAGE_PRESENT) ? ((f) & ~_PAGE_PSE) : (f)) -/* flush_area_all() can be used prior to any other CPU being online. */ -#define flush_area(v, f) flush_area_all((const void *)(v), f) +/* flush_area_mask() can be used prior to any other CPU being online. */ +#define flush_area_mask(m, v, f) flush_area_mask(m, (const void *)(v), f) #define L3T_INIT(page) (page) = ZERO_BLOCK_PTR @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( unsigned long nr_mfns, unsigned int flags) { - bool locking = system_state > SYS_STATE_boot; + bool global = virt < PERCPU_VIRT_START || + virt >= PERCPU_VIRT_SLOT(PERCPU_SLOTS); + bool locking = system_state > SYS_STATE_boot && global; + const cpumask_t *flush_mask = global ? &cpu_online_map + : cpumask_of(smp_processor_id()); l3_pgentry_t *pl3e = NULL, ol3e; l2_pgentry_t *pl2e = NULL, ol2e; l1_pgentry_t *pl1e, ol1e; @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( } \ } while (0) + /* Ensure it's a global mapping or it's only modifying the per-CPU area. */ + ASSERT(global || + (virt + nr_mfns * PAGE_SIZE >= PERCPU_VIRT_START && + virt + nr_mfns * PAGE_SIZE < PERCPU_VIRT_SLOT(PERCPU_SLOTS))); + L3T_INIT(current_l3page); while ( nr_mfns != 0 ) @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( if ( l3e_get_flags(ol3e) & _PAGE_PSE ) { flush_flags(lNf_to_l1f(l3e_get_flags(ol3e))); - flush_area(virt, flush_flags); + flush_area_mask(flush_mask, virt, flush_flags); } else { @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( unmap_domain_page(l1t); } } - flush_area(virt, flush_flags); + flush_area_mask(flush_mask, virt, flush_flags); for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) { ol2e = l2t[i]; @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( } if ( locking ) spin_unlock(&map_pgdir_lock); - flush_area(virt, flush_flags); + flush_area_mask(flush_mask, virt, flush_flags); free_xen_pagetable(l2mfn); } @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( if ( l2e_get_flags(ol2e) & _PAGE_PSE ) { flush_flags(lNf_to_l1f(l2e_get_flags(ol2e))); - flush_area(virt, flush_flags); + flush_area_mask(flush_mask, virt, flush_flags); } else { @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ ) flush_flags(l1e_get_flags(l1t[i])); - flush_area(virt, flush_flags); + flush_area_mask(flush_mask, virt, flush_flags); unmap_domain_page(l1t); free_xen_pagetable(l2e_get_mfn(ol2e)); } @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( } if ( locking ) spin_unlock(&map_pgdir_lock); - flush_area(virt, flush_flags); + flush_area_mask(flush_mask, virt, flush_flags); free_xen_pagetable(l1mfn); } @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( unsigned int flush_flags = FLUSH_TLB | FLUSH_ORDER(0); flush_flags(l1e_get_flags(ol1e)); - flush_area(virt, flush_flags); + flush_area_mask(flush_mask, virt, flush_flags); } virt += 1UL << L1_PAGETABLE_SHIFT; @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( l2e_write(pl2e, l2e_from_pfn(base_mfn, l1f_to_lNf(flags))); if ( locking ) spin_unlock(&map_pgdir_lock); - flush_area(virt - PAGE_SIZE, - FLUSH_TLB_GLOBAL | - FLUSH_ORDER(PAGETABLE_ORDER)); + flush_area_mask(flush_mask, virt - PAGE_SIZE, + FLUSH_TLB_GLOBAL | + FLUSH_ORDER(PAGETABLE_ORDER)); free_xen_pagetable(l2e_get_mfn(ol2e)); } else if ( locking ) @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( l3e_write(pl3e, l3e_from_pfn(base_mfn, l1f_to_lNf(flags))); if ( locking ) spin_unlock(&map_pgdir_lock); - flush_area(virt - PAGE_SIZE, - FLUSH_TLB_GLOBAL | - FLUSH_ORDER(2*PAGETABLE_ORDER)); + flush_area_mask(flush_mask, virt - PAGE_SIZE, + FLUSH_TLB_GLOBAL | + FLUSH_ORDER(2*PAGETABLE_ORDER)); free_xen_pagetable(l3e_get_mfn(ol3e)); } else if ( locking ) @@ -XXX,XX +XXX,XX @@ int __init populate_pt_range(unsigned long virt, unsigned long nr_mfns) */ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) { - bool locking = system_state > SYS_STATE_boot; + bool global = s < PERCPU_VIRT_START || + s >= PERCPU_VIRT_SLOT(PERCPU_SLOTS); + bool locking = system_state > SYS_STATE_boot && global; + const cpumask_t *flush_mask = global ? &cpu_online_map + : cpumask_of(smp_processor_id()); l3_pgentry_t *pl3e = NULL; l2_pgentry_t *pl2e = NULL; l1_pgentry_t *pl1e; @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) int rc = -ENOMEM; struct page_info *current_l3page; + ASSERT(global || + (e >= PERCPU_VIRT_START && e < PERCPU_VIRT_SLOT(PERCPU_SLOTS))); + /* Set of valid PTE bits which may be altered. */ #define FLAGS_MASK (_PAGE_NX|_PAGE_DIRTY|_PAGE_ACCESSED|_PAGE_RW|_PAGE_PRESENT) nf &= FLAGS_MASK; @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) l2e_write(pl2e, l2e_empty()); if ( locking ) spin_unlock(&map_pgdir_lock); - flush_area(NULL, FLUSH_TLB_GLOBAL); /* flush before free */ + /* flush before free */ + flush_area_mask(flush_mask, NULL, FLUSH_TLB_GLOBAL); free_xen_pagetable(l1mfn); } else if ( locking ) @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) l3e_write(pl3e, l3e_empty()); if ( locking ) spin_unlock(&map_pgdir_lock); - flush_area(NULL, FLUSH_TLB_GLOBAL); /* flush before free */ + /* flush before free */ + flush_area_mask(flush_mask, NULL, FLUSH_TLB_GLOBAL); free_xen_pagetable(l2mfn); } else if ( locking ) @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) } } - flush_area(NULL, FLUSH_TLB_GLOBAL); + flush_area_mask(flush_mask, NULL, FLUSH_TLB_GLOBAL); #undef FLAGS_MASK rc = 0; -- 2.45.2
Add support for modifying the per-CPU page-tables entries of remote CPUs, this will be required in order to setup the page-tables of CPUs before bringing them up. A restriction is added so that remote page-tables can only be modified as long as the remote CPU is not yet online. Non functional change, as there's no user introduced that modifies remote page-tables. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- Can be merged with previous patch? --- xen/arch/x86/include/asm/mm.h | 15 ++++++++++ xen/arch/x86/mm.c | 55 ++++++++++++++++++++++++++--------- 2 files changed, 56 insertions(+), 14 deletions(-) diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -XXX,XX +XXX,XX @@ void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt); int allocate_perdomain_local_l3(unsigned int cpu); void free_perdomain_local_l3(unsigned int cpu); +/* Specify the CPU idle root page-table to use for modifications. */ +int map_pages_to_xen_cpu( + unsigned long virt, + mfn_t mfn, + unsigned long nr_mfns, + unsigned int flags, + unsigned int cpu); +int modify_xen_mappings_cpu(unsigned long s, unsigned long e, unsigned int nf, + unsigned int cpu); +static inline int destroy_xen_mappings_cpu(unsigned long s, unsigned long e, + unsigned int cpu) +{ + return modify_xen_mappings_cpu(s, e, _PAGE_NONE, cpu); +} + #endif /* __ASM_X86_MM_H__ */ diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ static DEFINE_SPINLOCK(map_pgdir_lock); * For virt_to_xen_lXe() functions, they take a linear address and return a * pointer to Xen's LX entry. Caller needs to unmap the pointer. */ -static l3_pgentry_t *virt_to_xen_l3e(unsigned long v) +static l3_pgentry_t *virt_to_xen_l3e_cpu(unsigned long v, unsigned int cpu) { - unsigned int cpu = smp_processor_id(); /* Called before idle_vcpu is populated, fallback to idle_pg_table. */ root_pgentry_t *root_pgt = idle_vcpu[cpu] ? maddr_to_virt(idle_vcpu[cpu]->arch.cr3) : idle_pg_table; @@ -XXX,XX +XXX,XX @@ static l3_pgentry_t *virt_to_xen_l3e(unsigned long v) return map_l3t_from_l4e(*pl4e) + l3_table_offset(v); } -static l2_pgentry_t *virt_to_xen_l2e(unsigned long v) +static l3_pgentry_t *virt_to_xen_l3e(unsigned long v) +{ + return virt_to_xen_l3e_cpu(v, smp_processor_id()); +} + +static l2_pgentry_t *virt_to_xen_l2e_cpu(unsigned long v, unsigned int cpu) { l3_pgentry_t *pl3e, l3e; - pl3e = virt_to_xen_l3e(v); + pl3e = virt_to_xen_l3e_cpu(v, cpu); if ( !pl3e ) return NULL; @@ -XXX,XX +XXX,XX @@ static l2_pgentry_t *virt_to_xen_l2e(unsigned long v) return map_l2t_from_l3e(l3e) + l2_table_offset(v); } -static l1_pgentry_t *virt_to_xen_l1e(unsigned long v) +static l1_pgentry_t *virt_to_xen_l1e_cpu(unsigned long v, unsigned int cpu) { l2_pgentry_t *pl2e, l2e; - pl2e = virt_to_xen_l2e(v); + pl2e = virt_to_xen_l2e_cpu(v, cpu); if ( !pl2e ) return NULL; @@ -XXX,XX +XXX,XX @@ mfn_t xen_map_to_mfn(unsigned long va) return ret; } -int map_pages_to_xen( +int map_pages_to_xen_cpu( unsigned long virt, mfn_t mfn, unsigned long nr_mfns, - unsigned int flags) + unsigned int flags, + unsigned int cpu) { bool global = virt < PERCPU_VIRT_START || virt >= PERCPU_VIRT_SLOT(PERCPU_SLOTS); bool locking = system_state > SYS_STATE_boot && global; const cpumask_t *flush_mask = global ? &cpu_online_map - : cpumask_of(smp_processor_id()); + : cpumask_of(cpu); l3_pgentry_t *pl3e = NULL, ol3e; l2_pgentry_t *pl2e = NULL, ol2e; l1_pgentry_t *pl1e, ol1e; @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( (virt + nr_mfns * PAGE_SIZE >= PERCPU_VIRT_START && virt + nr_mfns * PAGE_SIZE < PERCPU_VIRT_SLOT(PERCPU_SLOTS))); + /* Only allow modifying remote page-tables if the CPU is not online. */ + ASSERT(cpu == smp_processor_id() || !cpu_online(cpu)); + L3T_INIT(current_l3page); while ( nr_mfns != 0 ) @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( UNMAP_DOMAIN_PAGE(pl3e); UNMAP_DOMAIN_PAGE(pl2e); - pl3e = virt_to_xen_l3e(virt); + pl3e = virt_to_xen_l3e_cpu(virt, cpu); if ( !pl3e ) goto out; @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( free_xen_pagetable(l2mfn); } - pl2e = virt_to_xen_l2e(virt); + pl2e = virt_to_xen_l2e_cpu(virt, cpu); if ( !pl2e ) goto out; @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( /* Normal page mapping. */ if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) ) { - pl1e = virt_to_xen_l1e(virt); + pl1e = virt_to_xen_l1e_cpu(virt, cpu); if ( pl1e == NULL ) goto out; } @@ -XXX,XX +XXX,XX @@ int map_pages_to_xen( return rc; } +int map_pages_to_xen( + unsigned long virt, + mfn_t mfn, + unsigned long nr_mfns, + unsigned int flags) +{ + return map_pages_to_xen_cpu(virt, mfn, nr_mfns, flags, smp_processor_id()); +} + + int __init populate_pt_range(unsigned long virt, unsigned long nr_mfns) { return map_pages_to_xen(virt, INVALID_MFN, nr_mfns, MAP_SMALL_PAGES); @@ -XXX,XX +XXX,XX @@ int __init populate_pt_range(unsigned long virt, unsigned long nr_mfns) * * It is an error to call with present flags over an unpopulated range. */ -int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) +int modify_xen_mappings_cpu(unsigned long s, unsigned long e, unsigned int nf, + unsigned int cpu) { bool global = s < PERCPU_VIRT_START || s >= PERCPU_VIRT_SLOT(PERCPU_SLOTS); @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) ASSERT(global || (e >= PERCPU_VIRT_START && e < PERCPU_VIRT_SLOT(PERCPU_SLOTS))); + /* Only allow modifying remote page-tables if the CPU is not online. */ + ASSERT(cpu == smp_processor_id() || !cpu_online(cpu)); + /* Set of valid PTE bits which may be altered. */ #define FLAGS_MASK (_PAGE_NX|_PAGE_DIRTY|_PAGE_ACCESSED|_PAGE_RW|_PAGE_PRESENT) nf &= FLAGS_MASK; @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) UNMAP_DOMAIN_PAGE(pl2e); UNMAP_DOMAIN_PAGE(pl3e); - pl3e = virt_to_xen_l3e(v); + pl3e = virt_to_xen_l3e_cpu(v, cpu); if ( !pl3e ) goto out; @@ -XXX,XX +XXX,XX @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) #undef flush_area +int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) +{ + return modify_xen_mappings_cpu(s, e, nf, smp_processor_id()); +} + int destroy_xen_mappings(unsigned long s, unsigned long e) { return modify_xen_mappings(s, e, _PAGE_NONE); -- 2.45.2
Introduce the logic to manage a per-CPU fixmap area. This includes adding a new set of headers that are capable of creating mappings in the per-CPU page-table regions by making use of the map_pages_to_xen_cpu(). This per-CPU fixmap area is currently set to use one L3 slot: 1GiB of linear address space. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/fixmap.h | 44 +++++++++++++++++++++++++++++++ xen/arch/x86/mm.c | 16 ++++++++++- 2 files changed, 59 insertions(+), 1 deletion(-) diff --git a/xen/arch/x86/include/asm/fixmap.h b/xen/arch/x86/include/asm/fixmap.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/fixmap.h +++ b/xen/arch/x86/include/asm/fixmap.h @@ -XXX,XX +XXX,XX @@ extern void __set_fixmap_x( #define __fix_x_to_virt(x) (FIXADDR_X_TOP - ((x) << PAGE_SHIFT)) #define fix_x_to_virt(x) ((void *)__fix_x_to_virt(x)) +/* per-CPU fixmap area. */ +enum percpu_fixed_addresses { + __end_of_percpu_fixed_addresses +}; + +#define PERCPU_FIXADDR_SIZE (__end_of_percpu_fixed_addresses << PAGE_SHIFT) +#define PERCPU_FIXADDR PERCPU_VIRT_SLOT(0) + +static inline void *percpu_fix_to_virt(enum percpu_fixed_addresses idx) +{ + BUG_ON(idx >=__end_of_percpu_fixed_addresses); + return (void *)PERCPU_FIXADDR + (idx << PAGE_SHIFT); +} + +static inline void percpu_set_fixmap_remote( + unsigned int cpu, enum percpu_fixed_addresses idx, mfn_t mfn, + unsigned long flags) +{ + map_pages_to_xen_cpu((unsigned long)percpu_fix_to_virt(idx), mfn, 1, flags, + cpu); +} + +static inline void percpu_clear_fixmap_remote( + unsigned int cpu, enum percpu_fixed_addresses idx) +{ + /* + * Use map_pages_to_xen_cpu() instead of destroy_xen_mappings_cpu() to + * avoid tearing down the intermediate page-tables if empty. + */ + map_pages_to_xen_cpu((unsigned long)percpu_fix_to_virt(idx), INVALID_MFN, 1, + 0, cpu); +} + +static inline void percpu_set_fixmap(enum percpu_fixed_addresses idx, mfn_t mfn, + unsigned long flags) +{ + percpu_set_fixmap_remote(smp_processor_id(), idx, mfn, flags); +} + +static inline void percpu_clear_fixmap(enum percpu_fixed_addresses idx) +{ + percpu_clear_fixmap_remote(smp_processor_id(), idx); +} + #endif /* __ASSEMBLY__ */ #endif diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ int allocate_perdomain_local_l3(unsigned int cpu) per_cpu(local_l3, cpu) = l3; - return 0; + /* + * Pre-allocate the page-table structures for the per-cpu fixmap. Some of + * the per-cpu fixmap calls might happen in contexts where memory + * allocation is not possible. + * + * Only one L3 slot is currently reserved for the per-CPU fixmap. + */ + BUILD_BUG_ON(PERCPU_FIXADDR_SIZE > (1 << L3_PAGETABLE_SHIFT)); + return map_pages_to_xen_cpu(PERCPU_VIRT_START, INVALID_MFN, + PFN_DOWN(PERCPU_FIXADDR_SIZE), MAP_SMALL_PAGES, + cpu); } void free_perdomain_local_l3(unsigned int cpu) @@ -XXX,XX +XXX,XX @@ void free_perdomain_local_l3(unsigned int cpu) return; per_cpu(local_l3, cpu) = NULL; + + destroy_xen_mappings_cpu(PERCPU_VIRT_START, + PERCPU_VIRT_START + PERCPU_FIXADDR_SIZE, cpu); + free_xenheap_page(l3); } -- 2.45.2
When running PV guests it's possible for the guest to use the same root page table (L4) for all vCPUs, which in turn will result in Xen also using the same root page table on all pCPUs that are running any domain vCPU. When using XPTI Xen switches to a per-CPU shadow L4 when running in guest context, switching to the fully populated L4 when in Xen context. Take advantage of this existing shadowing and force the usage of a per-CPU L4 that shadows the guest selected L4 when Address Space Isolation is requested for PV guests. The mapping of the guest L4 is done with a per-CPU fixmap entry, that however requires that the currently loaded L4 has the per-CPU slot setup. In order to ensure this switch to the shadow per-CPU L4 with just the Xen slots populated, and then map the guest L4 and copy the contents of the guest controlled slots. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/domain.c | 37 +++++++++++++++++++++ xen/arch/x86/flushtlb.c | 9 ++++++ xen/arch/x86/include/asm/current.h | 15 ++++++--- xen/arch/x86/include/asm/fixmap.h | 1 + xen/arch/x86/include/asm/pv/mm.h | 8 +++++ xen/arch/x86/mm.c | 47 +++++++++++++++++++++++++++ xen/arch/x86/pv/domain.c | 25 ++++++++++++-- xen/arch/x86/pv/mm.c | 52 ++++++++++++++++++++++++++++++ xen/arch/x86/smpboot.c | 20 +++++++++++- 9 files changed, 207 insertions(+), 7 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -XXX,XX +XXX,XX @@ #include <asm/io.h> #include <asm/processor.h> #include <asm/desc.h> +#include <asm/fixmap.h> #include <asm/i387.h> #include <asm/xstate.h> #include <asm/cpuidle.h> @@ -XXX,XX +XXX,XX @@ void context_switch(struct vcpu *prev, struct vcpu *next) local_irq_disable(); + if ( is_pv_domain(prevd) && prevd->arch.asi ) + { + /* + * Don't leak the L4 shadow mapping in the per-CPU area. Can't be done + * in paravirt_ctxt_switch_from() because the lazy idle vCPU context + * switch would otherwise enter an infinite loop in + * mapcache_current_vcpu() with sync_local_execstate(). + * + * Note clearing the fixmpa must strictly be done ahead of changing the + * current vCPU and with interrupts disabled, so there's no window + * where current->domain->arch.asi == true and PCPU_FIX_PV_L4SHADOW is + * not mapped. + */ + percpu_clear_fixmap(PCPU_FIX_PV_L4SHADOW); + get_cpu_info()->root_pgt_changed = false; + } + set_current(next); if ( (per_cpu(curr_vcpu, cpu) == next) || (is_idle_domain(nextd) && cpu_online(cpu)) ) { + if ( is_pv_domain(nextd) && nextd->arch.asi ) + { + /* Signal the fixmap entry must be mapped. */ + get_cpu_info()->new_cr3 = true; + if ( get_cpu_info()->root_pgt_changed ) + { + /* + * Map and update the shadow L4 in case we received any + * FLUSH_ROOT_PGTBL request while running on the idle vCPU. + * + * Do it before enabling interrupts so that no flush IPI can be + * delivered without having PCPU_FIX_PV_L4SHADOW correctly + * mapped. + */ + pv_update_shadow_l4(next, true); + get_cpu_info()->root_pgt_changed = false; + } + } + local_irq_enable(); } else diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/flushtlb.c +++ b/xen/arch/x86/flushtlb.c @@ -XXX,XX +XXX,XX @@ #include <asm/nops.h> #include <asm/page.h> #include <asm/pv/domain.h> +#include <asm/pv/mm.h> #include <asm/spec_ctrl.h> /* Debug builds: Wrap frequently to stress-test the wrap logic. */ @@ -XXX,XX +XXX,XX @@ unsigned int flush_area_local(const void *va, unsigned int flags) unsigned int order = (flags - 1) & FLUSH_ORDER_MASK; if ( flags & FLUSH_ROOT_PGTBL ) + { + const struct vcpu *curr = current; + const struct domain *curr_d = curr->domain; + get_cpu_info()->root_pgt_changed = true; + if ( is_pv_domain(curr_d) && curr_d->arch.asi ) + /* Update the shadow root page-table ahead of doing TLB flush. */ + pv_update_shadow_l4(curr, false); + } if ( flags & (FLUSH_TLB|FLUSH_TLB_GLOBAL) ) { diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/current.h +++ b/xen/arch/x86/include/asm/current.h @@ -XXX,XX +XXX,XX @@ struct cpu_info { uint8_t scf; /* SCF_* */ /* - * The following field controls copying of the L4 page table of 64-bit - * PV guests to the per-cpu root page table on entering the guest context. - * If set the L4 page table is being copied to the root page table and - * the field will be reset. + * For XPTI the following field controls copying of the L4 page table of + * 64-bit PV guests to the per-cpu root page table on entering the guest + * context. If set the L4 page table is being copied to the root page + * table and the field will be reset. + * + * For ASI the field is used to acknowledge whether a FLUSH_ROOT_PGTBL + * request has been received when running the idle vCPU on PV guest + * page-tables (a lazy context switch to the idle vCPU). */ bool root_pgt_changed; @@ -XXX,XX +XXX,XX @@ struct cpu_info { */ bool use_pv_cr3; + /* For ASI: per-CPU fixmap of guest L4 is possibly out of sync. */ + bool new_cr3; + /* get_stack_bottom() must be 16-byte aligned */ }; diff --git a/xen/arch/x86/include/asm/fixmap.h b/xen/arch/x86/include/asm/fixmap.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/fixmap.h +++ b/xen/arch/x86/include/asm/fixmap.h @@ -XXX,XX +XXX,XX @@ extern void __set_fixmap_x( /* per-CPU fixmap area. */ enum percpu_fixed_addresses { + PCPU_FIX_PV_L4SHADOW, __end_of_percpu_fixed_addresses }; diff --git a/xen/arch/x86/include/asm/pv/mm.h b/xen/arch/x86/include/asm/pv/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/pv/mm.h +++ b/xen/arch/x86/include/asm/pv/mm.h @@ -XXX,XX +XXX,XX @@ bool pv_destroy_ldt(struct vcpu *v); int validate_segdesc_page(struct page_info *page); +void pv_clear_l4_guest_entries(root_pgentry_t *root_pgt); +void pv_update_shadow_l4(const struct vcpu *v, bool flush); + #else #include <xen/errno.h> @@ -XXX,XX +XXX,XX @@ static inline bool pv_map_ldt_shadow_page(unsigned int off) { return false; } static inline bool pv_destroy_ldt(struct vcpu *v) { ASSERT_UNREACHABLE(); return false; } +static inline void pv_clear_l4_guest_entries(root_pgentry_t *root_pgt) +{ ASSERT_UNREACHABLE(); } +static inline void pv_update_shadow_l4(const struct vcpu *v, bool flush) +{ ASSERT_UNREACHABLE(); } + #endif #endif /* __X86_PV_MM_H__ */ diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ void make_cr3(struct vcpu *v, mfn_t mfn) v->arch.cr3 = mfn_x(mfn) << PAGE_SHIFT; if ( is_pv_domain(d) && d->arch.pv.pcid ) v->arch.cr3 |= get_pcid_bits(v, false); + if ( is_pv_domain(d) && d->arch.asi ) + get_cpu_info()->new_cr3 = true; } void write_ptbase(struct vcpu *v) @@ -XXX,XX +XXX,XX @@ void write_ptbase(struct vcpu *v) cpu_info->pv_cr3 |= get_pcid_bits(v, true); switch_cr3_cr4(v->arch.cr3, new_cr4); } + else if ( is_pv_domain(d) && d->arch.asi ) + { + root_pgentry_t *root_pgt = this_cpu(root_pgt); + unsigned long cr3 = __pa(root_pgt); + + /* + * XPTI and ASI cannot be simultaneously used even by different + * domains at runtime. + */ + ASSERT(!cpu_info->use_pv_cr3 && !cpu_info->xen_cr3 && + !cpu_info->pv_cr3); + + if ( new_cr4 & X86_CR4_PCIDE ) + cr3 |= get_pcid_bits(v, false); + + /* + * Zap guest L4 entries ahead of flushing the TLB, so that the CPU + * cannot speculatively populate the TLB with stale mappings. + */ + pv_clear_l4_guest_entries(root_pgt); + + /* + * Switch to the shadow L4 with just the Xen slots populated, the guest + * slots will be populated by pv_update_shadow_l4() once running on the + * shadow L4. + * + * The reason for switching to the per-CPU shadow L4 before updating + * the guest slots is that pv_update_shadow_l4() uses per-CPU mappings, + * and the in-use page-table previous to the switch_cr3_cr4() call + * might not support per-CPU mappings. + */ + switch_cr3_cr4(cr3, new_cr4); + pv_update_shadow_l4(v, false); + } else { ASSERT(!is_hvm_domain(d) || !d->arch.asi @@ -XXX,XX +XXX,XX @@ void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt) ASSERT(l3); populate_perdomain(d, root_pgt, l3); + + if ( is_pv_domain(d) ) + { + /* + * Abuse the fact that this function is called on vCPU context + * switch and clean previous guest controlled slots from the shadow + * L4. + */ + pv_clear_l4_guest_entries(root_pgt); + get_cpu_info()->new_cr3 = true; + } } else if ( is_hvm_domain(d) || d->arch.pv.xpti ) l4e_write(&root_pgt[root_table_offset(PERDOMAIN_VIRT_START)], diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ #include <asm/invpcid.h> #include <asm/spec_ctrl.h> #include <asm/pv/domain.h> +#include <asm/pv/mm.h> #include <asm/shadow.h> #ifdef CONFIG_PV32 @@ -XXX,XX +XXX,XX @@ int pv_domain_initialise(struct domain *d) d->arch.ctxt_switch = &pv_csw; - d->arch.pv.flush_root_pt = d->arch.pv.xpti; + d->arch.pv.flush_root_pt = d->arch.pv.xpti || d->arch.asi; if ( !is_pv_32bit_domain(d) && use_invpcid && cpu_has_pcid ) switch ( ACCESS_ONCE(opt_pcid) ) @@ -XXX,XX +XXX,XX @@ static void _toggle_guest_pt(struct vcpu *v) * to release). Switch to the idle page tables in such an event; the * guest will have been crashed already. */ - cr3 = v->arch.cr3; + if ( v->domain->arch.asi ) + { + /* + * _toggle_guest_pt() might switch between user and kernel page tables, + * but doesn't use write_ptbase(), and hence needs an explicit call to + * sync the shadow L4. + */ + cr3 = __pa(this_cpu(root_pgt)); + if ( v->domain->arch.pv.pcid ) + cr3 |= get_pcid_bits(v, false); + /* + * Ensure the current root page table is already the shadow L4, as + * guest user/kernel switches can only happen once the guest is + * running. + */ + ASSERT(read_cr3() == cr3); + pv_update_shadow_l4(v, false); + } + else + cr3 = v->arch.cr3; + if ( shadow_mode_enabled(v->domain) ) { cr3 &= ~X86_CR3_NOFLUSH; diff --git a/xen/arch/x86/pv/mm.c b/xen/arch/x86/pv/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/mm.c +++ b/xen/arch/x86/pv/mm.c @@ -XXX,XX +XXX,XX @@ #include <xen/guest_access.h> #include <asm/current.h> +#include <asm/fixmap.h> #include <asm/p2m.h> #include "mm.h" @@ -XXX,XX +XXX,XX @@ void init_xen_pae_l2_slots(l2_pgentry_t *l2t, const struct domain *d) } #endif +void pv_clear_l4_guest_entries(root_pgentry_t *root_pgt) +{ + unsigned int i; + + for ( i = 0; i < ROOT_PAGETABLE_FIRST_XEN_SLOT; i++ ) + l4e_write(&root_pgt[i], l4e_empty()); + for ( i = ROOT_PAGETABLE_LAST_XEN_SLOT + 1; i < L4_PAGETABLE_ENTRIES; i++ ) + l4e_write(&root_pgt[i], l4e_empty()); +} + +void pv_update_shadow_l4(const struct vcpu *v, bool flush) +{ + const root_pgentry_t *guest_pgt = percpu_fix_to_virt(PCPU_FIX_PV_L4SHADOW); + root_pgentry_t *shadow_pgt = this_cpu(root_pgt); + + ASSERT(!v->domain->arch.pv.xpti); + ASSERT(is_pv_vcpu(v)); + ASSERT(!is_idle_vcpu(v)); + + if ( get_cpu_info()->new_cr3 ) + { + percpu_set_fixmap(PCPU_FIX_PV_L4SHADOW, maddr_to_mfn(v->arch.cr3), + __PAGE_HYPERVISOR_RO); + get_cpu_info()->new_cr3 = false; + } + + if ( is_pv_32bit_vcpu(v) ) + { + l4e_write(&shadow_pgt[0], guest_pgt[0]); + l4e_write(&shadow_pgt[root_table_offset(PERDOMAIN_ALT_VIRT_START)], + shadow_pgt[root_table_offset(PERDOMAIN_VIRT_START)]); + } + else + { + unsigned int i; + + for ( i = 0; i < ROOT_PAGETABLE_FIRST_XEN_SLOT; i++ ) + l4e_write(&shadow_pgt[i], guest_pgt[i]); + for ( i = ROOT_PAGETABLE_LAST_XEN_SLOT + 1; + i < L4_PAGETABLE_ENTRIES; i++ ) + l4e_write(&shadow_pgt[i], guest_pgt[i]); + + /* The presence of this Xen slot is selected by the guest. */ + l4e_write(&shadow_pgt[l4_table_offset(RO_MPT_VIRT_START)], + guest_pgt[l4_table_offset(RO_MPT_VIRT_START)]); + } + + if ( flush ) + flush_local(FLUSH_TLB_GLOBAL); +} + /* * Local variables: * mode: C diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -XXX,XX +XXX,XX @@ int setup_cpu_root_pgt(unsigned int cpu) unsigned int off; int rc; - if ( !opt_xpti_hwdom && !opt_xpti_domu ) + if ( !opt_xpti_hwdom && !opt_xpti_domu && !opt_asi_pv ) return 0; rpt = alloc_xenheap_page(); @@ -XXX,XX +XXX,XX @@ int setup_cpu_root_pgt(unsigned int cpu) clear_page(rpt); per_cpu(root_pgt, cpu) = rpt; + if ( opt_asi_pv ) + { + /* + * Populate the Xen slots, the guest ones will be copied from the guest + * root page-table. + */ + init_xen_l4_slots(rpt, _mfn(virt_to_mfn(rpt)), INVALID_MFN, NULL, + false, false, true); + + return 0; + } + rpt[root_table_offset(RO_MPT_VIRT_START)] = idle_pg_table[root_table_offset(RO_MPT_VIRT_START)]; /* SH_LINEAR_PT inserted together with guest mappings. */ @@ -XXX,XX +XXX,XX @@ static void cleanup_cpu_root_pgt(unsigned int cpu) per_cpu(root_pgt, cpu) = NULL; + if ( opt_asi_pv ) + { + free_xenheap_page(rpt); + return; + } + for ( r = root_table_offset(DIRECTMAP_VIRT_START); r < root_table_offset(HYPERVISOR_VIRT_END); ++r ) { -- 2.45.2
When using ASI the CPU stack is mapped using a range of fixmap entries in the per-CPU region. This ensures the stack is only accessible by the current CPU. Note however there's further work required in order to allocate the stack from domheap instead of xenheap, and ensure the stack is not part of the direct map. For domains not running with ASI enabled all the CPU stacks are mapped in the per-domain L3, so that the stack is always at the same linear address, regardless of whether ASI is enabled or not for the domain. When calling UEFI runtime methods the current per-domain slot needs to be added to the EFI L4, so that the stack is available in UEFI. Finally, some users of callfunc IPIs pass parameters from the stack, so when handling a callfunc IPI the stack of the caller CPU is mapped into the address space of the CPU handling the IPI. This needs further work to use a bounce buffer in order to avoid having to map remote CPU stacks. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- There's also further work required in order to avoid mapping remote stack when handling callfunc IPIs. --- xen/arch/x86/domain.c | 12 +++ xen/arch/x86/include/asm/current.h | 5 ++ xen/arch/x86/include/asm/fixmap.h | 5 ++ xen/arch/x86/include/asm/mm.h | 6 +- xen/arch/x86/include/asm/smp.h | 12 +++ xen/arch/x86/mm.c | 125 +++++++++++++++++++++++++++-- xen/arch/x86/setup.c | 27 +++++-- xen/arch/x86/smp.c | 29 +++++++ xen/arch/x86/smpboot.c | 47 ++++++++++- xen/arch/x86/traps.c | 6 +- xen/common/efi/runtime.c | 12 +++ xen/common/smp.c | 10 +++ xen/include/xen/smp.h | 5 ++ 13 files changed, 281 insertions(+), 20 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -XXX,XX +XXX,XX @@ int arch_domain_create(struct domain *d, d->arch.msr_relaxed = config->arch.misc_flags & XEN_X86_MSR_RELAXED; + if ( !d->arch.asi && (opt_asi_hvm || opt_asi_pv ) ) + { + /* + * This domain is not using ASI, but other domains on the system + * possibly are, hence the CPU stacks are on the per-CPU page-table + * region. Add an L3 entry that has all the stacks mapped. + */ + rc = map_all_stacks(d); + if ( rc ) + goto fail; + } + return 0; fail: diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/current.h +++ b/xen/arch/x86/include/asm/current.h @@ -XXX,XX +XXX,XX @@ * 0 - IST Shadow Stacks (4x 1k, read-only) */ +static inline bool is_shstk_slot(unsigned int i) +{ + return (i == 0 || i == PRIMARY_SHSTK_SLOT); +} + /* * Identify which stack page the stack pointer is on. Returns an index * as per the comment above. diff --git a/xen/arch/x86/include/asm/fixmap.h b/xen/arch/x86/include/asm/fixmap.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/fixmap.h +++ b/xen/arch/x86/include/asm/fixmap.h @@ -XXX,XX +XXX,XX @@ extern void __set_fixmap_x( /* per-CPU fixmap area. */ enum percpu_fixed_addresses { + /* For alignment reasons the per-CPU stacks must come first. */ + PCPU_STACK_START, + PCPU_STACK_END = PCPU_STACK_START + NR_CPUS * (1U << STACK_ORDER) - 1, +#define PERCPU_STACK_IDX(c) (PCPU_STACK_START + (c) * (1U << STACK_ORDER)) +#define PERCPU_STACK_ADDR(c) percpu_fix_to_virt(PERCPU_STACK_IDX(c)) PCPU_FIX_PV_L4SHADOW, __end_of_percpu_fixed_addresses }; diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -XXX,XX +XXX,XX @@ extern struct rangeset *mmio_ro_ranges; #define compat_pfn_to_cr3(pfn) (((unsigned)(pfn) << 12) | ((unsigned)(pfn) >> 20)) #define compat_cr3_to_pfn(cr3) (((unsigned)(cr3) >> 12) | ((unsigned)(cr3) << 20)) -void memguard_guard_stack(void *p); +void memguard_guard_stack(void *p, unsigned int cpu); void memguard_unguard_stack(void *p); struct mmio_ro_emulate_ctxt { @@ -XXX,XX +XXX,XX @@ static inline int destroy_xen_mappings_cpu(unsigned long s, unsigned long e, return modify_xen_mappings_cpu(s, e, _PAGE_NONE, cpu); } +/* Setup a per-domain slot that maps all pCPU stacks. */ +int map_all_stacks(struct domain *d); +int add_stack(const void *stack, unsigned int cpu); + #endif /* __ASM_X86_MM_H__ */ diff --git a/xen/arch/x86/include/asm/smp.h b/xen/arch/x86/include/asm/smp.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/smp.h +++ b/xen/arch/x86/include/asm/smp.h @@ -XXX,XX +XXX,XX @@ extern bool unaccounted_cpus; void *cpu_alloc_stack(unsigned int cpu); +/* + * Setup the per-CPU area stack mappings. + * + * @dest_cpu: CPU where the mappings are to appear. + * @stack_cpu: CPU whose stacks should be mapped. + */ +void cpu_set_stack_mappings(unsigned int dest_cpu, unsigned int stack_cpu); + +#define HAS_ARCH_SMP_CALLFUNC +void arch_smp_pre_callfunc(unsigned int cpu); +void arch_smp_post_callfunc(unsigned int cpu); + #endif /* !__ASSEMBLY__ */ #endif diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ * doing the final put_page(), and remove it from the iommu if so. */ +#include <xen/cpu.h> #include <xen/init.h> #include <xen/ioreq.h> #include <xen/kernel.h> @@ -XXX,XX +XXX,XX @@ void free_perdomain_mappings(struct domain *d) d->arch.perdomain_l3_pg = NULL; } -static void write_sss_token(unsigned long *ptr) +static void write_sss_token(unsigned long *ptr, unsigned long va) { /* * A supervisor shadow stack token is its own linear address, with the * busy bit (0) clear. */ - *ptr = (unsigned long)ptr; + *ptr = va; } -void memguard_guard_stack(void *p) +void memguard_guard_stack(void *p, unsigned int cpu) { + unsigned long va = + (opt_asi_hvm || opt_asi_pv) ? (unsigned long)PERCPU_STACK_ADDR(cpu) + : (unsigned long)p; + /* IST Shadow stacks. 4x 1k in stack page 0. */ if ( IS_ENABLED(CONFIG_XEN_SHSTK) ) { - write_sss_token(p + (IST_MCE * IST_SHSTK_SIZE) - 8); - write_sss_token(p + (IST_NMI * IST_SHSTK_SIZE) - 8); - write_sss_token(p + (IST_DB * IST_SHSTK_SIZE) - 8); - write_sss_token(p + (IST_DF * IST_SHSTK_SIZE) - 8); + write_sss_token(p + (IST_MCE * IST_SHSTK_SIZE) - 8, + va + (IST_MCE * IST_SHSTK_SIZE) - 8); + write_sss_token(p + (IST_NMI * IST_SHSTK_SIZE) - 8, + va + (IST_NMI * IST_SHSTK_SIZE) - 8); + write_sss_token(p + (IST_DB * IST_SHSTK_SIZE) - 8, + va + (IST_DB * IST_SHSTK_SIZE) - 8); + write_sss_token(p + (IST_DF * IST_SHSTK_SIZE) - 8, + va + (IST_DF * IST_SHSTK_SIZE) - 8); } map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK); /* Primary Shadow Stack. 1x 4k in stack page 5. */ p += PRIMARY_SHSTK_SLOT * PAGE_SIZE; + va += PRIMARY_SHSTK_SLOT * PAGE_SIZE; if ( IS_ENABLED(CONFIG_XEN_SHSTK) ) - write_sss_token(p + PAGE_SIZE - 8); + write_sss_token(p + PAGE_SIZE - 8, va + PAGE_SIZE - 8); map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK); } @@ -XXX,XX +XXX,XX @@ void setup_perdomain_slot(const struct vcpu *v, root_pgentry_t *root_pgt) root_pgt[root_table_offset(PERDOMAIN_VIRT_START)]); } +static struct page_info *l2_all_stacks; + +int add_stack(const void *stack, unsigned int cpu) +{ + unsigned long va = (unsigned long)PERCPU_STACK_ADDR(cpu); + struct page_info *pg; + l2_pgentry_t *l2tab = NULL; + l1_pgentry_t *l1tab = NULL; + unsigned int nr; + int rc = 0; + + /* + * Assume CPU stack allocation is always serialized, either because it's + * done on the BSP during boot, or in case of hotplug, in stop machine + * context. + */ + ASSERT(system_state < SYS_STATE_active || cpu_in_hotplug_context()); + + if ( !opt_asi_hvm && !opt_asi_pv ) + return 0; + + if ( !l2_all_stacks ) + { + l2_all_stacks = alloc_domheap_page(NULL, MEMF_no_owner); + if ( !l2_all_stacks ) + return -ENOMEM; + l2tab = __map_domain_page(l2_all_stacks); + clear_page(l2tab); + } + else + l2tab = __map_domain_page(l2_all_stacks); + + /* code assumes all the stacks can be mapped with a single l2. */ + ASSERT(l3_table_offset((unsigned long)percpu_fix_to_virt(PCPU_STACK_END)) == + l3_table_offset((unsigned long)percpu_fix_to_virt(PCPU_STACK_START))); + for ( nr = 0 ; nr < (1U << STACK_ORDER) ; nr++) + { + l2_pgentry_t *pl2e = l2tab + l2_table_offset(va); + + if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) ) + { + pg = alloc_domheap_page(NULL, MEMF_no_owner); + if ( !pg ) + { + rc = -ENOMEM; + break; + } + l1tab = __map_domain_page(pg); + clear_page(l1tab); + l2e_write(pl2e, l2e_from_page(pg, __PAGE_HYPERVISOR_RW)); + } + else if ( !l1tab ) + l1tab = map_l1t_from_l2e(*pl2e); + + l1e_write(&l1tab[l1_table_offset(va)], + l1e_from_mfn(virt_to_mfn(stack), + is_shstk_slot(nr) ? __PAGE_HYPERVISOR_SHSTK + : __PAGE_HYPERVISOR_RW)); + + va += PAGE_SIZE; + stack += PAGE_SIZE; + + if ( !l1_table_offset(va) ) + { + unmap_domain_page(l1tab); + l1tab = NULL; + } + } + + unmap_domain_page(l1tab); + unmap_domain_page(l2tab); + /* + * Don't care to free the intermediate page-tables on failure, can be used + * to map other stacks. + */ + + return rc; +} + +int map_all_stacks(struct domain *d) +{ + /* + * Create the per-domain L3. Pass a dummy PERDOMAIN_VIRT_START, but note + * only the per-domain L3 is allocated when nr == 0. + */ + int rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL); + l3_pgentry_t *l3tab; + + if ( rc ) + return rc; + + l3tab = __map_domain_page(d->arch.perdomain_l3_pg); + l3tab[l3_table_offset((unsigned long)percpu_fix_to_virt(PCPU_STACK_START))] + = l3e_from_page(l2_all_stacks, __PAGE_HYPERVISOR_RW); + unmap_domain_page(l3tab); + + return 0; +} + static void __init __maybe_unused build_assertions(void) { /* diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -XXX,XX +XXX,XX @@ static void __init noreturn reinit_bsp_stack(void) /* Update SYSCALL trampolines */ percpu_traps_init(); - stack_base[0] = stack; - rc = setup_cpu_root_pgt(0); if ( rc ) panic("Error %d setting up PV root page table\n", rc); @@ -XXX,XX +XXX,XX @@ void asmlinkage __init noreturn __start_xen(unsigned long mbi_p) system_state = SYS_STATE_boot; - bsp_stack = cpu_alloc_stack(0); - if ( !bsp_stack ) - panic("No memory for BSP stack\n"); - console_init_ring(); vesa_init(); @@ -XXX,XX +XXX,XX @@ void asmlinkage __init noreturn __start_xen(unsigned long mbi_p) alternative_branches(); + /* + * Alloc the BSP stack closer to the point where the AP ones also get + * allocated - and after the speculation mitigations have been initialized. + * In order to set up the shadow stack token correctly Xen needs to know + * whether per-CPU mapped stacks are being used. + */ + bsp_stack = cpu_alloc_stack(0); + if ( !bsp_stack ) + panic("No memory for BSP stack\n"); + /* * Setup the local per-domain L3 for the BSP also, so it matches the state * of the APs. @@ -XXX,XX +XXX,XX @@ void asmlinkage __init noreturn __start_xen(unsigned long mbi_p) info->last_spec_ctrl = default_xen_spec_ctrl; } + stack_base[0] = bsp_stack; + /* Copy the cpu info block, and move onto the BSP stack. */ - bsp_info = get_cpu_info_from_stack((unsigned long)bsp_stack); + if ( opt_asi_hvm || opt_asi_pv ) + { + cpu_set_stack_mappings(0, 0); + bsp_info = get_cpu_info_from_stack((unsigned long)PERCPU_STACK_ADDR(0)); + } + else + bsp_info = get_cpu_info_from_stack((unsigned long)bsp_stack); + *bsp_info = *info; asm volatile ("mov %[stk], %%rsp; jmp %c[fn]" :: diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/smp.c +++ b/xen/arch/x86/smp.c @@ -XXX,XX +XXX,XX @@ #include <asm/hardirq.h> #include <asm/hpet.h> #include <asm/setup.h> +#include <asm/spec_ctrl.h> #include <irq_vectors.h> #include <mach_apic.h> @@ -XXX,XX +XXX,XX @@ long cf_check cpu_down_helper(void *data) ret = cpu_down(cpu); return ret; } + +void arch_smp_pre_callfunc(unsigned int cpu) +{ + if ( (!opt_asi_pv && !opt_asi_hvm) || cpu == smp_processor_id() || + (!current->domain->arch.asi && !is_idle_vcpu(current)) || + /* + * CPU#0 still runs on the .init stack when the APs are started, don't + * attempt to map such stack. + */ + (!cpu && system_state < SYS_STATE_active) ) + return; + + cpu_set_stack_mappings(smp_processor_id(), cpu); +} + +void arch_smp_post_callfunc(unsigned int cpu) +{ + unsigned int i; + + if ( (!opt_asi_pv && !opt_asi_hvm) || cpu == smp_processor_id() || + (!current->domain->arch.asi && !is_idle_vcpu(current)) ) + return; + + for ( i = 0; i < (1U << STACK_ORDER); i++ ) + percpu_clear_fixmap(PERCPU_STACK_IDX(cpu) + i); + + flush_area_local(PERCPU_STACK_ADDR(cpu), FLUSH_ORDER(STACK_ORDER)); +} diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -XXX,XX +XXX,XX @@ static int do_boot_cpu(int apicid, int cpu) printk("Booting processor %d/%d eip %lx\n", cpu, apicid, start_eip); - stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info); + if ( opt_asi_hvm || opt_asi_pv ) + { + /* + * Uniformly run with the stack mapping of the per-CPU area (including + * the idle vCPU) if ASI is enabled for any domain type. + */ + cpu_set_stack_mappings(cpu, cpu); + + ASSERT(IS_ALIGNED((unsigned long)PERCPU_STACK_ADDR(cpu), STACK_SIZE)); + + stack_start = PERCPU_STACK_ADDR(cpu) + STACK_SIZE - sizeof(struct cpu_info); + } + else + stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info); /* * If per-CPU idle root page table has been allocated, switch to it as @@ -XXX,XX +XXX,XX @@ void *cpu_alloc_stack(unsigned int cpu) stack = alloc_xenheap_pages(STACK_ORDER, memflags); if ( stack ) - memguard_guard_stack(stack); + { + int rc = add_stack(stack, cpu); + + if ( rc ) + { + printk(XENLOG_ERR "unable to map stack for CPU %u: %d\n", cpu, rc); + free_xenheap_pages(stack, STACK_ORDER); + return NULL; + } + memguard_guard_stack(stack, cpu); + } return stack; } +void cpu_set_stack_mappings(unsigned int dest_cpu, unsigned int stack_cpu) +{ + unsigned int i; + + for ( i = 0; i < (1U << STACK_ORDER); i++ ) + { + unsigned int flags = (is_shstk_slot(i) ? __PAGE_HYPERVISOR_SHSTK + : __PAGE_HYPERVISOR_RW) | + (dest_cpu == stack_cpu ? _PAGE_GLOBAL : 0); + + if ( is_shstk_slot(i) && dest_cpu != stack_cpu ) + continue; + + percpu_set_fixmap_remote(dest_cpu, PERCPU_STACK_IDX(stack_cpu) + i, + _mfn(virt_to_mfn(stack_base[stack_cpu] + + i * PAGE_SIZE)), + flags); + } +} + static int cpu_smpboot_alloc(unsigned int cpu) { struct cpu_info *info; diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -XXX,XX +XXX,XX @@ void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs) unsigned long esp = regs->rsp; unsigned long curr_stack_base = esp & ~(STACK_SIZE - 1); unsigned long esp_top, esp_bottom; + const void *stack = current->domain->arch.asi ? PERCPU_STACK_ADDR(cpu) + : stack_base[cpu]; - if ( _p(curr_stack_base) != stack_base[cpu] ) + if ( _p(curr_stack_base) != stack ) printk("Current stack base %p differs from expected %p\n", - _p(curr_stack_base), stack_base[cpu]); + _p(curr_stack_base), stack); esp_bottom = (esp | (STACK_SIZE - 1)) + 1; esp_top = esp_bottom - PRIMARY_STACK_SIZE; diff --git a/xen/common/efi/runtime.c b/xen/common/efi/runtime.c index XXXXXXX..XXXXXXX 100644 --- a/xen/common/efi/runtime.c +++ b/xen/common/efi/runtime.c @@ -XXX,XX +XXX,XX @@ void efi_rs_leave(struct efi_rs_state *state); #ifndef CONFIG_ARM # include <asm/i387.h> +# include <asm/spec_ctrl.h> # include <asm/xstate.h> # include <public/platform.h> #endif @@ -XXX,XX +XXX,XX @@ struct efi_rs_state efi_rs_enter(void) static const u16 fcw = FCW_DEFAULT; static const u32 mxcsr = MXCSR_DEFAULT; struct efi_rs_state state = { .cr3 = 0 }; + root_pgentry_t *efi_pgt, *idle_pgt; if ( mfn_eq(efi_l4_mfn, INVALID_MFN) ) return state; @@ -XXX,XX +XXX,XX @@ struct efi_rs_state efi_rs_enter(void) efi_rs_on_cpu = smp_processor_id(); + if ( opt_asi_pv || opt_asi_hvm ) + { + /* Insert the idle per-domain slot for the stack mapping. */ + efi_pgt = map_domain_page(efi_l4_mfn); + idle_pgt = maddr_to_virt(idle_vcpu[efi_rs_on_cpu]->arch.cr3); + efi_pgt[root_table_offset(PERDOMAIN_VIRT_START)].l4 = + idle_pgt[root_table_offset(PERDOMAIN_VIRT_START)].l4; + unmap_domain_page(efi_pgt); + } + /* prevent fixup_page_fault() from doing anything */ irq_enter(); diff --git a/xen/common/smp.c b/xen/common/smp.c index XXXXXXX..XXXXXXX 100644 --- a/xen/common/smp.c +++ b/xen/common/smp.c @@ -XXX,XX +XXX,XX @@ static struct call_data_struct { void (*func) (void *info); void *info; int wait; + unsigned int caller; cpumask_t selected; } call_data; @@ -XXX,XX +XXX,XX @@ void on_selected_cpus( call_data.func = func; call_data.info = info; call_data.wait = wait; + call_data.caller = smp_processor_id(); smp_send_call_function_mask(&call_data.selected); @@ -XXX,XX +XXX,XX @@ void smp_call_function_interrupt(void) if ( !cpumask_test_cpu(cpu, &call_data.selected) ) return; + /* + * TODO: use bounce buffers to pass callfunc data, so that when using ASI + * there's no need to map remote CPU stacks. + */ + arch_smp_pre_callfunc(call_data.caller); + irq_enter(); if ( unlikely(!func) ) @@ -XXX,XX +XXX,XX @@ void smp_call_function_interrupt(void) } irq_exit(); + + arch_smp_post_callfunc(call_data.caller); } /* diff --git a/xen/include/xen/smp.h b/xen/include/xen/smp.h index XXXXXXX..XXXXXXX 100644 --- a/xen/include/xen/smp.h +++ b/xen/include/xen/smp.h @@ -XXX,XX +XXX,XX @@ extern void *stack_base[NR_CPUS]; void initialize_cpu_data(unsigned int cpu); int setup_cpu_root_pgt(unsigned int cpu); +#ifndef HAS_ARCH_SMP_CALLFUNC +static inline void arch_smp_pre_callfunc(unsigned int cpu) {} +static inline void arch_smp_post_callfunc(unsigned int cpu) {} +#endif + #endif /* __XEN_SMP_H__ */ -- 2.45.2
With the stack mapped on a per-CPU basis there's no risk of other CPUs being able to read the stack contents, but vCPUs running on the current pCPU could read stack rubble from operations of previous vCPUs. The #DF stack is not zeroed because handling of #DF results in a panic. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/current.h | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/current.h +++ b/xen/arch/x86/include/asm/current.h @@ -XXX,XX +XXX,XX @@ unsigned long get_stack_dump_bottom (unsigned long sp); # define SHADOW_STACK_WORK "" #endif +#define ZERO_STACK \ + "test %[stk_size], %[stk_size];" \ + "jz .L_skip_zeroing.%=;" \ + "std;" \ + "rep stosb;" \ + "cld;" \ + ".L_skip_zeroing.%=:" + #if __GNUC__ >= 9 # define ssaj_has_attr_noreturn(fn) __builtin_has_attribute(fn, __noreturn__) #else @@ -XXX,XX +XXX,XX @@ unsigned long get_stack_dump_bottom (unsigned long sp); #define switch_stack_and_jump(fn, instr, constr) \ ({ \ unsigned int tmp; \ + bool zero_stack = current->domain->arch.asi; \ BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn)); \ + ASSERT(IS_ALIGNED((unsigned long)guest_cpu_user_regs() - \ + PRIMARY_STACK_SIZE + \ + sizeof(struct cpu_info), PAGE_SIZE)); \ + if ( zero_stack ) \ + { \ + unsigned long stack_top = get_stack_bottom() & \ + ~(STACK_SIZE - 1); \ + \ + clear_page((void *)stack_top + IST_MCE * PAGE_SIZE); \ + clear_page((void *)stack_top + IST_NMI * PAGE_SIZE); \ + clear_page((void *)stack_top + IST_DB * PAGE_SIZE); \ + } \ __asm__ __volatile__ ( \ SHADOW_STACK_WORK \ "mov %[stk], %%rsp;" \ + ZERO_STACK \ CHECK_FOR_LIVEPATCH_WORK \ instr "[fun]" \ : [val] "=&r" (tmp), \ @@ -XXX,XX +XXX,XX @@ unsigned long get_stack_dump_bottom (unsigned long sp); ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8), \ [stack_mask] "i" (STACK_SIZE - 1), \ _ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__, \ - __FILE__, NULL) \ + __FILE__, NULL), \ + /* For stack zeroing. */ \ + "D" ((void *)guest_cpu_user_regs() - 1), \ + [stk_size] "c" \ + (zero_stack ? PRIMARY_STACK_SIZE - sizeof(struct cpu_info)\ + : 0), \ + "a" (0) \ : "memory" ); \ unreachable(); \ }) -- 2.45.2
Hello, The aim of this series is to introduce the functionality required to create linear mappings visible to a single pCPU. Doing so requires having a per-vCPU root page-table (L4), and hence requires shadowing the guest selected L4 on PV guests. As follow ups (and partially to ensure the per-CPU mappings work fine) the CPU stacks are switched to use per-CPU mappings, so that remote stack contents are not by default mapped on all page-tables (note: for this to be true the directmap entries for the stack pages would need to be removed also). There's one known shortcoming with the presented code: migration of PV guests using per-vCPU root page-tables is not working. I need to introduce extra logic to deal with PV shadow mode when using unique root page-tables. I don't think this should block the series however, such missing functionality can always be added as follow up work. paging_domctl() is adjusted to reflect this restriction. The main differences compared to v1 are the usage of per-vCPU root page tables (as opposed to per-pCPU), and the usage of the existing perdomain family of functions to manage the mappings in the per-domain slot, that now becomes per-vCPU. All patches until 17 are mostly preparatory, I think there's a nice cleanup and generalization of the creation and managing of per-domain mappings, by no longer storing references to L1 page-tables in the vCPU or domain struct. Patch 13 introduces the command line option, and would need discussion and integration with the sparse direct map series. IMO we should get consensus on how we want the command line to look ASAP, so that we can basic parsing logic in place to be used by both the work here and the direct map removal series. As part of this series the map_domain_page() helpers are also switched to create per-vCPU mappings (see patch 15), which converts an existing interface into creating per-vCPU mappings. Such interface can be used to hide (map per-vCPU) further data that we don't want to be part of the direct map, or even shared between vCPUs of the same domain. Also all existing users of the interface will already create per-vCPU mappings without needing additional changes. Note that none of the logic introduced in the series removes entries for the directmap, so even when creating the per-CPU mappings the underlying physical addresses are fully accessible when using it's direct map entries. I also haven't done any benchmarking. Doesn't seem to cripple performance up to the point that XenRT jobs would timeout before finishing, that the only objective reference I can provide at the moment. The series has been extensively tested on XenRT, but that doesn't cover all possible use-cases, so it's likely to still have some rough edges, handle with care. Thanks, Roger. Roger Pau Monne (18): x86/mm: purge unneeded destroy_perdomain_mapping() x86/domain: limit window where curr_vcpu != current on context switch x86/mm: introduce helper to detect per-domain L1 entries that need freeing x86/pv: introduce function to populate perdomain area and use it to map Xen GDT x86/mm: switch destroy_perdomain_mapping() parameter from domain to vCPU x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping() x86/pv: update guest LDT mappings using the linear entries x86/pv: remove stashing of GDT/LDT L1 page-tables x86/mm: simplify create_perdomain_mapping() interface x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter to vCPU x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush x86/spec-ctrl: introduce Address Space Isolation command line option x86/mm: introduce per-vCPU L3 page-table x86/mm: introduce a per-vCPU mapcache when using ASI x86/pv: allow using a unique per-pCPU root page table (L4) x86/mm: switch to a per-CPU mapped stack when using ASI x86/mm: zero stack on context switch docs/misc/xen-command-line.pandoc | 24 +++ xen/arch/x86/cpu/mcheck/mce.c | 4 + xen/arch/x86/domain.c | 157 +++++++++++---- xen/arch/x86/domain_page.c | 105 ++++++---- xen/arch/x86/flushtlb.c | 28 ++- xen/arch/x86/hvm/hvm.c | 6 - xen/arch/x86/include/asm/config.h | 16 +- xen/arch/x86/include/asm/current.h | 58 +++++- xen/arch/x86/include/asm/desc.h | 6 +- xen/arch/x86/include/asm/domain.h | 50 +++-- xen/arch/x86/include/asm/flushtlb.h | 2 +- xen/arch/x86/include/asm/mm.h | 15 +- xen/arch/x86/include/asm/processor.h | 5 + xen/arch/x86/include/asm/pv/mm.h | 5 + xen/arch/x86/include/asm/smp.h | 12 ++ xen/arch/x86/include/asm/spec_ctrl.h | 4 + xen/arch/x86/mm.c | 291 +++++++++++++++++++++------ xen/arch/x86/mm/hap/hap.c | 2 +- xen/arch/x86/mm/paging.c | 6 + xen/arch/x86/mm/shadow/hvm.c | 2 +- xen/arch/x86/mm/shadow/multi.c | 2 +- xen/arch/x86/pv/descriptor-tables.c | 47 ++--- xen/arch/x86/pv/dom0_build.c | 12 +- xen/arch/x86/pv/domain.c | 57 ++++-- xen/arch/x86/pv/mm.c | 43 +++- xen/arch/x86/setup.c | 32 ++- xen/arch/x86/smp.c | 39 ++++ xen/arch/x86/smpboot.c | 26 ++- xen/arch/x86/spec_ctrl.c | 205 ++++++++++++++++++- xen/arch/x86/traps.c | 25 ++- xen/arch/x86/x86_64/mm.c | 7 +- xen/common/smp.c | 10 + xen/common/stop_machine.c | 10 + xen/include/xen/smp.h | 8 + 34 files changed, 1052 insertions(+), 269 deletions(-) -- 2.46.0
The destroy_perdomain_mapping() call in the hvm_domain_initialise() fail path is useless. destroy_perdomain_mapping() called with nr == 0 is effectively a no op, as there are not entries torn down. Remove the call, as arch_domain_create() already calls free_perdomain_mappings() on failure. There's also a call to destroy_perdomain_mapping() in pv_domain_destroy() which is also not needed. arch_domain_destroy() will already unconditionally call free_perdomain_mappings(), which does the same as destroy_perdomain_mapping(), plus additionally frees the page table structures. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/hvm/hvm.c | 1 - xen/arch/x86/pv/domain.c | 3 --- 2 files changed, 4 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -XXX,XX +XXX,XX @@ int hvm_domain_initialise(struct domain *d, XFREE(d->arch.hvm.irq); fail0: hvm_destroy_cacheattr_region_list(d); - destroy_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0); fail: hvm_domain_relinquish_resources(d); XFREE(d->arch.hvm.io_handler); diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ void pv_domain_destroy(struct domain *d) { pv_l1tf_domain_destroy(d); - destroy_perdomain_mapping(d, GDT_LDT_VIRT_START, - GDT_LDT_MBYTES << (20 - PAGE_SHIFT)); - XFREE(d->arch.pv.cpuidmasks); FREE_XENHEAP_PAGE(d->arch.pv.gdt_ldt_l1tab); -- 2.46.0
On x86 Xen will perform lazy context switches to the idle vCPU, where the previously running vCPU context is not overwritten, and only current is updated to point to the idle vCPU. The state is then disjunct between current and curr_vcpu: current points to the idle vCPU, while curr_vcpu points to the vCPU whose context is loaded on the pCPU. While on that lazy context switched state, certain calls (like map_domain_page()) will trigger a full synchronization of the pCPU state by forcing a context switch. Note however how calling any of such functions inside the context switch code itself is very likely to trigger an infinite recursion loop. Attempt to limit the window where curr_vcpu != current in the context switch code, as to prevent and infinite recursion loop around sync_local_execstate(). This is required for using map_domain_page() in the vCPU context switch code, otherwise using map_domain_page() in that context ends up in a recursive sync_local_execstate() loop: map_domain_page() -> sync_local_execstate() -> map_domain_page() -> ... Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- Changes since v1: - New in this version. --- xen/arch/x86/domain.c | 58 +++++++++++++++++++++++++++++++++++-------- xen/arch/x86/traps.c | 2 -- 2 files changed, 48 insertions(+), 12 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -XXX,XX +XXX,XX @@ static void load_default_gdt(unsigned int cpu) per_cpu(full_gdt_loaded, cpu) = false; } -static void __context_switch(void) +static void __context_switch(struct vcpu *n) { struct cpu_user_regs *stack_regs = guest_cpu_user_regs(); unsigned int cpu = smp_processor_id(); struct vcpu *p = per_cpu(curr_vcpu, cpu); - struct vcpu *n = current; struct domain *pd = p->domain, *nd = n->domain; ASSERT(p != n); ASSERT(!vcpu_cpu_dirty(n)); + ASSERT(p == current); if ( !is_idle_domain(pd) ) { @@ -XXX,XX +XXX,XX @@ static void __context_switch(void) write_ptbase(n); + /* + * It's relevant to set both current and curr_vcpu back-to-back, to avoid a + * window where calls to mapcache_current_vcpu() during the context switch + * could trigger a recursive loop. + * + * Do the current switch immediately after switching to the new guest + * page-tables, so that current is (almost) always in sync with the + * currently loaded page-tables. + */ + set_current(n); + per_cpu(curr_vcpu, cpu) = n; + #ifdef CONFIG_PV /* Prefetch the VMCB if we expect to use it later in the context switch */ if ( using_svm() && is_pv_64bit_domain(nd) && !is_idle_domain(nd) ) @@ -XXX,XX +XXX,XX @@ static void __context_switch(void) if ( pd != nd ) cpumask_clear_cpu(cpu, pd->dirty_cpumask); write_atomic(&p->dirty_cpu, VCPU_CPU_CLEAN); - - per_cpu(curr_vcpu, cpu) = n; } void context_switch(struct vcpu *prev, struct vcpu *next) @@ -XXX,XX +XXX,XX @@ void context_switch(struct vcpu *prev, struct vcpu *next) local_irq_disable(); - set_current(next); - if ( (per_cpu(curr_vcpu, cpu) == next) || (is_idle_domain(nextd) && cpu_online(cpu)) ) { + /* + * Lazy context switch to the idle vCPU, set current == idle. Full + * context switch happens if/when sync_local_execstate() is called. + */ + set_current(next); local_irq_enable(); } else { - __context_switch(); + /* + * curr_vcpu will always point to the currently loaded vCPU context, as + * it's not updated when doing a lazy switch to the idle vCPU. + */ + struct vcpu *prev_ctx = per_cpu(curr_vcpu, cpu); + + if ( prev_ctx != current ) + { + /* + * Doing a full context switch to a non-idle vCPU from a lazy + * context switched state. Adjust current to point to the + * currently loaded vCPU context. + */ + ASSERT(current == idle_vcpu[cpu]); + ASSERT(!is_idle_vcpu(next)); + set_current(prev_ctx); + } + __context_switch(next); /* Re-enable interrupts before restoring state which may fault. */ local_irq_enable(); @@ -XXX,XX +XXX,XX @@ int __sync_local_execstate(void) { unsigned long flags; int switch_required; + unsigned int cpu = smp_processor_id(); + struct vcpu *p; local_irq_save(flags); - switch_required = (this_cpu(curr_vcpu) != current); + p = per_cpu(curr_vcpu, cpu); + switch_required = (p != current); if ( switch_required ) { - ASSERT(current == idle_vcpu[smp_processor_id()]); - __context_switch(); + ASSERT(current == idle_vcpu[cpu]); + /* + * Restore current to the previously running vCPU, __context_switch() + * will update current together with curr_vcpu. + */ + set_current(p); + __context_switch(idle_vcpu[cpu]); } local_irq_restore(flags); diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -XXX,XX +XXX,XX @@ void __init trap_init(void) void activate_debugregs(const struct vcpu *curr) { - ASSERT(curr == current); - write_debugreg(0, curr->arch.dr[0]); write_debugreg(1, curr->arch.dr[1]); write_debugreg(2, curr->arch.dr[2]); -- 2.46.0
L1 present entries that require the underlying page to be freed have the _PAGE_AVAIL0 bit set, introduce a helper to unify the checking logic into a single place. No functional change intended. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/mm.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ void __iomem *__init ioremap_wc(paddr_t pa, size_t len) return (void __force __iomem *)(va + offs); } +static bool perdomain_l1e_needs_freeing(l1_pgentry_t l1e) +{ + return (l1e_get_flags(l1e) & (_PAGE_PRESENT | _PAGE_AVAIL0)) == + (_PAGE_PRESENT | _PAGE_AVAIL0); +} + int create_perdomain_mapping(struct domain *d, unsigned long va, unsigned int nr, l1_pgentry_t **pl1tab, struct page_info **ppg) @@ -XXX,XX +XXX,XX @@ void destroy_perdomain_mapping(struct domain *d, unsigned long va, for ( ; nr && i < L1_PAGETABLE_ENTRIES; --nr, ++i ) { - if ( (l1e_get_flags(l1tab[i]) & - (_PAGE_PRESENT | _PAGE_AVAIL0)) == - (_PAGE_PRESENT | _PAGE_AVAIL0) ) + if ( perdomain_l1e_needs_freeing(l1tab[i]) ) free_domheap_page(l1e_get_page(l1tab[i])); l1tab[i] = l1e_empty(); } @@ -XXX,XX +XXX,XX @@ void free_perdomain_mappings(struct domain *d) unsigned int k; for ( k = 0; k < L1_PAGETABLE_ENTRIES; ++k ) - if ( (l1e_get_flags(l1tab[k]) & - (_PAGE_PRESENT | _PAGE_AVAIL0)) == - (_PAGE_PRESENT | _PAGE_AVAIL0) ) + if ( perdomain_l1e_needs_freeing(l1tab[k]) ) free_domheap_page(l1e_get_page(l1tab[k])); unmap_domain_page(l1tab); -- 2.46.0
The current code to update the Xen part of the GDT when running a PV guest relies on caching the direct map address of all the L1 tables used to map the GDT and LDT, so that entries can be modified. Introduce a new function that populates the per-domain region, either using the recursive linear mappings when the target vCPU is the current one, or by directly modifying the L1 table of the per-domain region. Using such function to populate per-domain addresses drops the need to keep a reference to per-domain L1 tables previously used to change the per-domain mappings. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/domain.c | 11 +++- xen/arch/x86/include/asm/desc.h | 6 +- xen/arch/x86/include/asm/mm.h | 2 + xen/arch/x86/include/asm/processor.h | 5 ++ xen/arch/x86/mm.c | 88 ++++++++++++++++++++++++++++ xen/arch/x86/smpboot.c | 6 +- xen/arch/x86/traps.c | 10 ++-- 7 files changed, 113 insertions(+), 15 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -XXX,XX +XXX,XX @@ static always_inline bool need_full_gdt(const struct domain *d) static void update_xen_slot_in_full_gdt(const struct vcpu *v, unsigned int cpu) { - l1e_write(pv_gdt_ptes(v) + FIRST_RESERVED_GDT_PAGE, - !is_pv_32bit_vcpu(v) ? per_cpu(gdt_l1e, cpu) - : per_cpu(compat_gdt_l1e, cpu)); + ASSERT(v != current); + + populate_perdomain_mapping(v, + GDT_VIRT_START(v) + + (FIRST_RESERVED_GDT_PAGE << PAGE_SHIFT), + !is_pv_32bit_vcpu(v) ? &per_cpu(gdt_mfn, cpu) + : &per_cpu(compat_gdt_mfn, + cpu), 1); } static void load_full_gdt(const struct vcpu *v, unsigned int cpu) diff --git a/xen/arch/x86/include/asm/desc.h b/xen/arch/x86/include/asm/desc.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/desc.h +++ b/xen/arch/x86/include/asm/desc.h @@ -XXX,XX +XXX,XX @@ #ifndef __ASSEMBLY__ +#include <xen/mm-frame.h> + #define GUEST_KERNEL_RPL(d) (is_pv_32bit_domain(d) ? 1 : 3) /* Fix up the RPL of a guest segment selector. */ @@ -XXX,XX +XXX,XX @@ struct __packed desc_ptr { extern seg_desc_t boot_gdt[]; DECLARE_PER_CPU(seg_desc_t *, gdt); -DECLARE_PER_CPU(l1_pgentry_t, gdt_l1e); +DECLARE_PER_CPU(mfn_t, gdt_mfn); extern seg_desc_t boot_compat_gdt[]; DECLARE_PER_CPU(seg_desc_t *, compat_gdt); -DECLARE_PER_CPU(l1_pgentry_t, compat_gdt_l1e); +DECLARE_PER_CPU(mfn_t, compat_gdt_mfn); DECLARE_PER_CPU(bool, full_gdt_loaded); static inline void lgdt(const struct desc_ptr *gdtr) diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -XXX,XX +XXX,XX @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg); int create_perdomain_mapping(struct domain *d, unsigned long va, unsigned int nr, l1_pgentry_t **pl1tab, struct page_info **ppg); +void populate_perdomain_mapping(const struct vcpu *v, unsigned long va, + mfn_t *mfn, unsigned long nr); void destroy_perdomain_mapping(struct domain *d, unsigned long va, unsigned int nr); void free_perdomain_mappings(struct domain *d); diff --git a/xen/arch/x86/include/asm/processor.h b/xen/arch/x86/include/asm/processor.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/processor.h +++ b/xen/arch/x86/include/asm/processor.h @@ -XXX,XX +XXX,XX @@ static inline unsigned long cr3_pa(unsigned long cr3) return cr3 & X86_CR3_ADDR_MASK; } +static inline mfn_t cr3_mfn(unsigned long cr3) +{ + return maddr_to_mfn(cr3_pa(cr3)); +} + static inline unsigned int cr3_pcid(unsigned long cr3) { return IS_ENABLED(CONFIG_PV) ? cr3 & X86_CR3_PCID_MASK : 0; diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ int create_perdomain_mapping(struct domain *d, unsigned long va, return rc; } +void populate_perdomain_mapping(const struct vcpu *v, unsigned long va, + mfn_t *mfn, unsigned long nr) +{ + l1_pgentry_t *l1tab = NULL, *pl1e; + const l3_pgentry_t *l3tab; + const l2_pgentry_t *l2tab; + struct domain *d = v->domain; + + ASSERT(va >= PERDOMAIN_VIRT_START && + va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS)); + ASSERT(!nr || !l3_table_offset(va ^ (va + nr * PAGE_SIZE - 1))); + + /* Use likely to force the optimization for the fast path. */ + if ( likely(v == current) ) + { + unsigned int i; + + /* Ensure page-tables are from current (if current != curr_vcpu). */ + sync_local_execstate(); + + /* Fast path: get L1 entries using the recursive linear mappings. */ + pl1e = &__linear_l1_table[l1_linear_offset(va)]; + + for ( i = 0; i < nr; i++, pl1e++ ) + { + if ( unlikely(perdomain_l1e_needs_freeing(*pl1e)) ) + { + ASSERT_UNREACHABLE(); + free_domheap_page(l1e_get_page(*pl1e)); + } + l1e_write(pl1e, l1e_from_mfn(mfn[i], __PAGE_HYPERVISOR_RW)); + } + + return; + } + + ASSERT(d->arch.perdomain_l3_pg); + l3tab = __map_domain_page(d->arch.perdomain_l3_pg); + + if ( unlikely(!(l3e_get_flags(l3tab[l3_table_offset(va)]) & + _PAGE_PRESENT)) ) + { + unmap_domain_page(l3tab); + gprintk(XENLOG_ERR, "unable to map at VA %lx: L3e not present\n", va); + ASSERT_UNREACHABLE(); + domain_crash(d); + + return; + } + + l2tab = map_l2t_from_l3e(l3tab[l3_table_offset(va)]); + + for ( ; nr--; va += PAGE_SIZE, mfn++ ) + { + if ( !l1tab || !l1_table_offset(va) ) + { + const l2_pgentry_t *pl2e = l2tab + l2_table_offset(va); + + if ( unlikely(!(l2e_get_flags(*pl2e) & _PAGE_PRESENT)) ) + { + gprintk(XENLOG_ERR, "unable to map at VA %lx: L2e not present\n", + va); + ASSERT_UNREACHABLE(); + domain_crash(d); + + break; + } + + unmap_domain_page(l1tab); + l1tab = map_l1t_from_l2e(*pl2e); + } + + pl1e = &l1tab[l1_table_offset(va)]; + + if ( unlikely(perdomain_l1e_needs_freeing(*pl1e)) ) + { + ASSERT_UNREACHABLE(); + free_domheap_page(l1e_get_page(*pl1e)); + } + + l1e_write(pl1e, l1e_from_mfn(*mfn, __PAGE_HYPERVISOR_RW)); + } + + unmap_domain_page(l1tab); + unmap_domain_page(l2tab); + unmap_domain_page(l3tab); +} + void destroy_perdomain_mapping(struct domain *d, unsigned long va, unsigned int nr) { diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -XXX,XX +XXX,XX @@ static int cpu_smpboot_alloc(unsigned int cpu) if ( gdt == NULL ) goto out; per_cpu(gdt, cpu) = gdt; - per_cpu(gdt_l1e, cpu) = - l1e_from_pfn(virt_to_mfn(gdt), __PAGE_HYPERVISOR_RW); + per_cpu(gdt_mfn, cpu) = _mfn(virt_to_mfn(gdt)); memcpy(gdt, boot_gdt, NR_RESERVED_GDT_PAGES * PAGE_SIZE); BUILD_BUG_ON(NR_CPUS > 0x10000); gdt[PER_CPU_GDT_ENTRY - FIRST_RESERVED_GDT_ENTRY].a = cpu; @@ -XXX,XX +XXX,XX @@ static int cpu_smpboot_alloc(unsigned int cpu) per_cpu(compat_gdt, cpu) = gdt = alloc_xenheap_pages(0, memflags); if ( gdt == NULL ) goto out; - per_cpu(compat_gdt_l1e, cpu) = - l1e_from_pfn(virt_to_mfn(gdt), __PAGE_HYPERVISOR_RW); + per_cpu(compat_gdt_mfn, cpu) = _mfn(virt_to_mfn(gdt)); memcpy(gdt, boot_compat_gdt, NR_RESERVED_GDT_PAGES * PAGE_SIZE); gdt[PER_CPU_GDT_ENTRY - FIRST_RESERVED_GDT_ENTRY].a = cpu; #endif diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -XXX,XX +XXX,XX @@ DEFINE_PER_CPU(uint64_t, efer); static DEFINE_PER_CPU(unsigned long, last_extable_addr); DEFINE_PER_CPU_READ_MOSTLY(seg_desc_t *, gdt); -DEFINE_PER_CPU_READ_MOSTLY(l1_pgentry_t, gdt_l1e); +DEFINE_PER_CPU_READ_MOSTLY(mfn_t, gdt_mfn); #ifdef CONFIG_PV32 DEFINE_PER_CPU_READ_MOSTLY(seg_desc_t *, compat_gdt); -DEFINE_PER_CPU_READ_MOSTLY(l1_pgentry_t, compat_gdt_l1e); +DEFINE_PER_CPU_READ_MOSTLY(mfn_t, compat_gdt_mfn); #endif /* Master table, used by CPU0. */ @@ -XXX,XX +XXX,XX @@ void __init trap_init(void) init_ler(); /* Cache {,compat_}gdt_l1e now that physically relocation is done. */ - this_cpu(gdt_l1e) = - l1e_from_pfn(virt_to_mfn(boot_gdt), __PAGE_HYPERVISOR_RW); + this_cpu(gdt_mfn) = _mfn(virt_to_mfn(boot_gdt)); if ( IS_ENABLED(CONFIG_PV32) ) - this_cpu(compat_gdt_l1e) = - l1e_from_pfn(virt_to_mfn(boot_compat_gdt), __PAGE_HYPERVISOR_RW); + this_cpu(compat_gdt_mfn) = _mfn(virt_to_mfn(boot_compat_gdt)); percpu_traps_init(); -- 2.46.0
In preparation for the per-domain area being populated with per-vCPU mappings change the parameter of destroy_perdomain_mapping() to be a vCPU instead of a domain, and also update the function logic to allow manipulation of per-domain mappings using the linear page table mappings. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/mm.h | 2 +- xen/arch/x86/mm.c | 24 +++++++++++++++++++++++- xen/arch/x86/pv/domain.c | 3 +-- xen/arch/x86/x86_64/mm.c | 2 +- 4 files changed, 26 insertions(+), 5 deletions(-) diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -XXX,XX +XXX,XX @@ int create_perdomain_mapping(struct domain *d, unsigned long va, struct page_info **ppg); void populate_perdomain_mapping(const struct vcpu *v, unsigned long va, mfn_t *mfn, unsigned long nr); -void destroy_perdomain_mapping(struct domain *d, unsigned long va, +void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va, unsigned int nr); void free_perdomain_mappings(struct domain *d); diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ void populate_perdomain_mapping(const struct vcpu *v, unsigned long va, unmap_domain_page(l3tab); } -void destroy_perdomain_mapping(struct domain *d, unsigned long va, +void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va, unsigned int nr) { const l3_pgentry_t *l3tab, *pl3e; + const struct domain *d = v->domain; ASSERT(va >= PERDOMAIN_VIRT_START && va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS)); @@ -XXX,XX +XXX,XX @@ void destroy_perdomain_mapping(struct domain *d, unsigned long va, if ( !d->arch.perdomain_l3_pg ) return; + /* Use likely to force the optimization for the fast path. */ + if ( likely(v == current) ) + { + l1_pgentry_t *pl1e; + + /* Ensure page-tables are from current (if current != curr_vcpu). */ + sync_local_execstate(); + + pl1e = &__linear_l1_table[l1_linear_offset(va)]; + + /* Fast path: zap L1 entries using the recursive linear mappings. */ + for ( ; nr--; pl1e++ ) + { + if ( perdomain_l1e_needs_freeing(*pl1e) ) + free_domheap_page(l1e_get_page(*pl1e)); + l1e_write(pl1e, l1e_empty()); + } + + return; + } + l3tab = __map_domain_page(d->arch.perdomain_l3_pg); pl3e = l3tab + l3_table_offset(va); diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ static int pv_create_gdt_ldt_l1tab(struct vcpu *v) static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v) { - destroy_perdomain_mapping(v->domain, GDT_VIRT_START(v), - 1U << GDT_LDT_VCPU_SHIFT); + destroy_perdomain_mapping(v, GDT_VIRT_START(v), 1U << GDT_LDT_VCPU_SHIFT); } void pv_vcpu_destroy(struct vcpu *v) diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -XXX,XX +XXX,XX @@ int setup_compat_arg_xlat(struct vcpu *v) void free_compat_arg_xlat(struct vcpu *v) { - destroy_perdomain_mapping(v->domain, ARG_XLAT_START(v), + destroy_perdomain_mapping(v, ARG_XLAT_START(v), PFN_UP(COMPAT_ARG_XLAT_SIZE)); } -- 2.46.0
The pv_{set,destroy}_gdt() functions rely on the L1 table(s) that contain such mappings being stashed in the domain structure, and thus such mappings being modified by merely updating the L1 entries. Switch both pv_{set,destroy}_gdt() to instead use {populate,destory}_perdomain_mapping(). Note that this requires moving the pv_set_gdt() call in arch_set_info_guest() strictly after update_cr3(), so v->arch.cr3 is valid when populate_perdomain_mapping() is called. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/domain.c | 33 ++++++++++++++--------------- xen/arch/x86/pv/descriptor-tables.c | 28 +++++++++++------------- 2 files changed, 28 insertions(+), 33 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -XXX,XX +XXX,XX @@ int arch_set_info_guest( if ( rc ) return rc; - if ( !compat ) - rc = pv_set_gdt(v, c.nat->gdt_frames, c.nat->gdt_ents); -#ifdef CONFIG_COMPAT - else - { - unsigned long gdt_frames[ARRAY_SIZE(v->arch.pv.gdt_frames)]; - - for ( i = 0; i < nr_gdt_frames; ++i ) - gdt_frames[i] = c.cmp->gdt_frames[i]; - - rc = pv_set_gdt(v, gdt_frames, c.cmp->gdt_ents); - } -#endif - if ( rc != 0 ) - return rc; - set_bit(_VPF_in_reset, &v->pause_flags); #ifdef CONFIG_COMPAT @@ -XXX,XX +XXX,XX @@ int arch_set_info_guest( { if ( cr3_page ) put_page(cr3_page); - pv_destroy_gdt(v); return rc; } @@ -XXX,XX +XXX,XX @@ int arch_set_info_guest( paging_update_paging_modes(v); else update_cr3(v); + + if ( !compat ) + rc = pv_set_gdt(v, c.nat->gdt_frames, c.nat->gdt_ents); +#ifdef CONFIG_COMPAT + else + { + unsigned long gdt_frames[ARRAY_SIZE(v->arch.pv.gdt_frames)]; + + for ( i = 0; i < nr_gdt_frames; ++i ) + gdt_frames[i] = c.cmp->gdt_frames[i]; + + rc = pv_set_gdt(v, gdt_frames, c.cmp->gdt_ents); + } +#endif + if ( rc != 0 ) + return rc; #endif /* CONFIG_PV */ out: diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/descriptor-tables.c +++ b/xen/arch/x86/pv/descriptor-tables.c @@ -XXX,XX +XXX,XX @@ bool pv_destroy_ldt(struct vcpu *v) void pv_destroy_gdt(struct vcpu *v) { - l1_pgentry_t *pl1e = pv_gdt_ptes(v); - mfn_t zero_mfn = _mfn(virt_to_mfn(zero_page)); - l1_pgentry_t zero_l1e = l1e_from_mfn(zero_mfn, __PAGE_HYPERVISOR_RO); unsigned int i; ASSERT(v == current || !vcpu_cpu_dirty(v)); - v->arch.pv.gdt_ents = 0; - for ( i = 0; i < FIRST_RESERVED_GDT_PAGE; i++ ) - { - mfn_t mfn = l1e_get_mfn(pl1e[i]); + if ( v->arch.cr3 ) + destroy_perdomain_mapping(v, GDT_VIRT_START(v), + ARRAY_SIZE(v->arch.pv.gdt_frames)); - if ( (l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) && - !mfn_eq(mfn, zero_mfn) ) - put_page_and_type(mfn_to_page(mfn)); + for ( i = 0; i < ARRAY_SIZE(v->arch.pv.gdt_frames); i++) + { + if ( !v->arch.pv.gdt_frames[i] ) + break; - l1e_write(&pl1e[i], zero_l1e); + put_page_and_type(mfn_to_page(_mfn(v->arch.pv.gdt_frames[i]))); v->arch.pv.gdt_frames[i] = 0; } } @@ -XXX,XX +XXX,XX @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[], unsigned int entries) { struct domain *d = v->domain; - l1_pgentry_t *pl1e; unsigned int i, nr_frames = DIV_ROUND_UP(entries, 512); + mfn_t mfns[ARRAY_SIZE(v->arch.pv.gdt_frames)]; ASSERT(v == current || !vcpu_cpu_dirty(v)); @@ -XXX,XX +XXX,XX @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[], if ( !mfn_valid(mfn) || !get_page_and_type(mfn_to_page(mfn), d, PGT_seg_desc_page) ) goto fail; + + mfns[i] = mfn; } /* Tear down the old GDT. */ @@ -XXX,XX +XXX,XX @@ int pv_set_gdt(struct vcpu *v, const unsigned long frames[], /* Install the new GDT. */ v->arch.pv.gdt_ents = entries; - pl1e = pv_gdt_ptes(v); for ( i = 0; i < nr_frames; i++ ) - { v->arch.pv.gdt_frames[i] = frames[i]; - l1e_write(&pl1e[i], l1e_from_pfn(frames[i], __PAGE_HYPERVISOR_RW)); - } + populate_perdomain_mapping(v, GDT_VIRT_START(v), mfns, nr_frames); return 0; -- 2.46.0
The pv_map_ldt_shadow_page() and pv_destroy_ldt() functions rely on the L1 table(s) that contain such mappings being stashed in the domain structure, and thus such mappings being modified by merely updating the require L1 entries. Switch pv_map_ldt_shadow_page() to unconditionally use the linear recursive, as that logic is always called while the vCPU is running on the current pCPU. For pv_destroy_ldt() use the linear mappings if the vCPU is the one currently running on the pCPU, otherwise use destroy_mappings(). Note this requires keeping an array with the pages currently mapped at the LDT area, as that allows dropping the extra taken page reference when removing the mappings. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/domain.h | 2 ++ xen/arch/x86/pv/descriptor-tables.c | 19 ++++++++++--------- xen/arch/x86/pv/domain.c | 4 ++++ xen/arch/x86/pv/mm.c | 3 ++- 4 files changed, 18 insertions(+), 10 deletions(-) diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct pv_vcpu struct trap_info *trap_ctxt; unsigned long gdt_frames[FIRST_RESERVED_GDT_PAGE]; + /* Max LDT entries is 8192, so 8192 * 8 = 64KiB (16 pages). */ + mfn_t ldt_frames[16]; unsigned long ldt_base; unsigned int gdt_ents, ldt_ents; diff --git a/xen/arch/x86/pv/descriptor-tables.c b/xen/arch/x86/pv/descriptor-tables.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/descriptor-tables.c +++ b/xen/arch/x86/pv/descriptor-tables.c @@ -XXX,XX +XXX,XX @@ */ bool pv_destroy_ldt(struct vcpu *v) { - l1_pgentry_t *pl1e; + const unsigned int nr_frames = ARRAY_SIZE(v->arch.pv.ldt_frames); unsigned int i, mappings_dropped = 0; - struct page_info *page; ASSERT(!in_irq()); ASSERT(v == current || !vcpu_cpu_dirty(v)); - pl1e = pv_ldt_ptes(v); + destroy_perdomain_mapping(v, LDT_VIRT_START(v), nr_frames); - for ( i = 0; i < 16; i++ ) + for ( i = 0; i < nr_frames; i++ ) { - if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) ) - continue; + mfn_t mfn = v->arch.pv.ldt_frames[i]; + struct page_info *page; - page = l1e_get_page(pl1e[i]); - l1e_write(&pl1e[i], l1e_empty()); - mappings_dropped++; + if ( mfn_eq(mfn, INVALID_MFN) ) + continue; + v->arch.pv.ldt_frames[i] = INVALID_MFN; + page = mfn_to_page(mfn); ASSERT_PAGE_IS_TYPE(page, PGT_seg_desc_page); ASSERT_PAGE_IS_DOMAIN(page, v->domain); put_page_and_type(page); + mappings_dropped++; } return mappings_dropped; diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ void pv_vcpu_destroy(struct vcpu *v) int pv_vcpu_initialise(struct vcpu *v) { struct domain *d = v->domain; + unsigned int i; int rc; ASSERT(!is_idle_domain(d)); @@ -XXX,XX +XXX,XX @@ int pv_vcpu_initialise(struct vcpu *v) if ( rc ) return rc; + for ( i = 0; i < ARRAY_SIZE(v->arch.pv.ldt_frames); i++ ) + v->arch.pv.ldt_frames[i] = INVALID_MFN; + BUILD_BUG_ON(X86_NR_VECTORS * sizeof(*v->arch.pv.trap_ctxt) > PAGE_SIZE); v->arch.pv.trap_ctxt = xzalloc_array(struct trap_info, X86_NR_VECTORS); diff --git a/xen/arch/x86/pv/mm.c b/xen/arch/x86/pv/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/mm.c +++ b/xen/arch/x86/pv/mm.c @@ -XXX,XX +XXX,XX @@ bool pv_map_ldt_shadow_page(unsigned int offset) return false; } - pl1e = &pv_ldt_ptes(curr)[offset >> PAGE_SHIFT]; + curr->arch.pv.ldt_frames[offset >> PAGE_SHIFT] = page_to_mfn(page); + pl1e = &__linear_l1_table[l1_linear_offset(LDT_VIRT_START(curr) + offset)]; l1e_add_flags(gl1e, _PAGE_RW); l1e_write(pl1e, gl1e); -- 2.46.0
There are no remaining callers of pv_gdt_ptes() or pv_ldt_ptes() that use the stashed L1 page-tables in the domain structure. As such, the helpers and the fields can now be removed. No functional change intended, as the removed logic is not used. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/domain.h | 9 --------- xen/arch/x86/pv/domain.c | 10 +--------- 2 files changed, 1 insertion(+), 18 deletions(-) diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct time_scale { struct pv_domain { - l1_pgentry_t **gdt_ldt_l1tab; - atomic_t nr_l4_pages; /* Is a 32-bit PV guest? */ @@ -XXX,XX +XXX,XX @@ struct arch_domain #define has_pirq(d) (!!((d)->arch.emulation_flags & X86_EMU_USE_PIRQ)) #define has_vpci(d) (!!((d)->arch.emulation_flags & X86_EMU_VPCI)) -#define gdt_ldt_pt_idx(v) \ - ((v)->vcpu_id >> (PAGETABLE_ORDER - GDT_LDT_VCPU_SHIFT)) -#define pv_gdt_ptes(v) \ - ((v)->domain->arch.pv.gdt_ldt_l1tab[gdt_ldt_pt_idx(v)] + \ - (((v)->vcpu_id << GDT_LDT_VCPU_SHIFT) & (L1_PAGETABLE_ENTRIES - 1))) -#define pv_ldt_ptes(v) (pv_gdt_ptes(v) + 16) - struct pv_vcpu { /* map_domain_page() mapping cache. */ diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ static int pv_create_gdt_ldt_l1tab(struct vcpu *v) { return create_perdomain_mapping(v->domain, GDT_VIRT_START(v), 1U << GDT_LDT_VCPU_SHIFT, - v->domain->arch.pv.gdt_ldt_l1tab, + NIL(l1_pgentry_t *), NULL); } @@ -XXX,XX +XXX,XX @@ void pv_domain_destroy(struct domain *d) pv_l1tf_domain_destroy(d); XFREE(d->arch.pv.cpuidmasks); - - FREE_XENHEAP_PAGE(d->arch.pv.gdt_ldt_l1tab); } void noreturn cf_check continue_pv_domain(void); @@ -XXX,XX +XXX,XX @@ int pv_domain_initialise(struct domain *d) pv_l1tf_domain_init(d); - d->arch.pv.gdt_ldt_l1tab = - alloc_xenheap_pages(0, MEMF_node(domain_to_node(d))); - if ( !d->arch.pv.gdt_ldt_l1tab ) - goto fail; - clear_page(d->arch.pv.gdt_ldt_l1tab); - if ( levelling_caps & ~LCAP_faulting && (d->arch.pv.cpuidmasks = xmemdup(&cpuidmask_defaults)) == NULL ) goto fail; -- 2.46.0
There are no longer any callers of create_perdomain_mapping() that request a reference to the used L1 tables, and hence the only difference between them is whether the caller wants the region to be populated, or just the paging structures to be allocated. Simplify the arguments to create_perdomain_mapping() to reflect the current usages: drop the last two arguments and instead introduce a boolean to signal whether the caller wants the region populated. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/domain_page.c | 10 ++++---- xen/arch/x86/hvm/hvm.c | 2 +- xen/arch/x86/include/asm/mm.h | 3 +-- xen/arch/x86/mm.c | 43 +++++++---------------------------- xen/arch/x86/pv/domain.c | 4 +--- xen/arch/x86/x86_64/mm.c | 3 +-- 6 files changed, 16 insertions(+), 49 deletions(-) diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -XXX,XX +XXX,XX @@ int mapcache_domain_init(struct domain *d) spin_lock_init(&dcache->lock); return create_perdomain_mapping(d, (unsigned long)dcache->inuse, - 2 * bitmap_pages + 1, - NIL(l1_pgentry_t *), NULL); + 2 * bitmap_pages + 1, false); } int mapcache_vcpu_init(struct vcpu *v) @@ -XXX,XX +XXX,XX @@ int mapcache_vcpu_init(struct vcpu *v) if ( ents > dcache->entries ) { /* Populate page tables. */ - int rc = create_perdomain_mapping(d, MAPCACHE_VIRT_START, ents, - NIL(l1_pgentry_t *), NULL); + int rc = create_perdomain_mapping(d, MAPCACHE_VIRT_START, ents, false); /* Populate bit maps. */ if ( !rc ) rc = create_perdomain_mapping(d, (unsigned long)dcache->inuse, - nr, NULL, NIL(struct page_info *)); + nr, true); if ( !rc ) rc = create_perdomain_mapping(d, (unsigned long)dcache->garbage, - nr, NULL, NIL(struct page_info *)); + nr, true); if ( rc ) return rc; diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -XXX,XX +XXX,XX @@ int hvm_domain_initialise(struct domain *d, INIT_LIST_HEAD(&d->arch.hvm.mmcfg_regions); INIT_LIST_HEAD(&d->arch.hvm.msix_tables); - rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, NULL, NULL); + rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, false); if ( rc ) goto fail; diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -XXX,XX +XXX,XX @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg); #define IS_NIL(ptr) (!((uintptr_t)(ptr) + sizeof(*(ptr)))) int create_perdomain_mapping(struct domain *d, unsigned long va, - unsigned int nr, l1_pgentry_t **pl1tab, - struct page_info **ppg); + unsigned int nr, bool populate); void populate_perdomain_mapping(const struct vcpu *v, unsigned long va, mfn_t *mfn, unsigned long nr); void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va, diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ static bool perdomain_l1e_needs_freeing(l1_pgentry_t l1e) } int create_perdomain_mapping(struct domain *d, unsigned long va, - unsigned int nr, l1_pgentry_t **pl1tab, - struct page_info **ppg) + unsigned int nr, bool populate) { struct page_info *pg; l3_pgentry_t *l3tab; @@ -XXX,XX +XXX,XX @@ int create_perdomain_mapping(struct domain *d, unsigned long va, unmap_domain_page(l3tab); - if ( !pl1tab && !ppg ) - { - unmap_domain_page(l2tab); - return 0; - } - for ( l1tab = NULL; !rc && nr--; ) { l2_pgentry_t *pl2e = l2tab + l2_table_offset(va); if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) ) { - if ( pl1tab && !IS_NIL(pl1tab) ) - { - l1tab = alloc_xenheap_pages(0, MEMF_node(domain_to_node(d))); - if ( !l1tab ) - { - rc = -ENOMEM; - break; - } - ASSERT(!pl1tab[l2_table_offset(va)]); - pl1tab[l2_table_offset(va)] = l1tab; - pg = virt_to_page(l1tab); - } - else + pg = alloc_domheap_page(d, MEMF_no_owner); + if ( !pg ) { - pg = alloc_domheap_page(d, MEMF_no_owner); - if ( !pg ) - { - rc = -ENOMEM; - break; - } - l1tab = __map_domain_page(pg); + rc = -ENOMEM; + break; } + l1tab = __map_domain_page(pg); clear_page(l1tab); *pl2e = l2e_from_page(pg, __PAGE_HYPERVISOR_RW); } else if ( !l1tab ) l1tab = map_l1t_from_l2e(*pl2e); - if ( ppg && + if ( populate && !(l1e_get_flags(l1tab[l1_table_offset(va)]) & _PAGE_PRESENT) ) { pg = alloc_domheap_page(d, MEMF_no_owner); if ( pg ) { clear_domain_page(page_to_mfn(pg)); - if ( !IS_NIL(ppg) ) - *ppg++ = pg; l1tab[l1_table_offset(va)] = l1e_from_page(pg, __PAGE_HYPERVISOR_RW | _PAGE_AVAIL0); l2e_add_flags(*pl2e, _PAGE_AVAIL0); @@ -XXX,XX +XXX,XX @@ void free_perdomain_mappings(struct domain *d) unmap_domain_page(l1tab); } - if ( is_xen_heap_page(l1pg) ) - free_xenheap_page(page_to_virt(l1pg)); - else - free_domheap_page(l1pg); + free_domheap_page(l1pg); } unmap_domain_page(l2tab); diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ int switch_compat(struct domain *d) static int pv_create_gdt_ldt_l1tab(struct vcpu *v) { return create_perdomain_mapping(v->domain, GDT_VIRT_START(v), - 1U << GDT_LDT_VCPU_SHIFT, - NIL(l1_pgentry_t *), - NULL); + 1U << GDT_LDT_VCPU_SHIFT, false); } static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v) diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -XXX,XX +XXX,XX @@ void __init zap_low_mappings(void) int setup_compat_arg_xlat(struct vcpu *v) { return create_perdomain_mapping(v->domain, ARG_XLAT_START(v), - PFN_UP(COMPAT_ARG_XLAT_SIZE), - NULL, NIL(struct page_info *)); + PFN_UP(COMPAT_ARG_XLAT_SIZE), true); } void free_compat_arg_xlat(struct vcpu *v) -- 2.46.0
In preparation for the per-domain area being per-vCPU. This requires moving some of the {create,destroy}_perdomain_mapping() calls to the domain initialization and tear down paths into vCPU initialization and tear down. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/domain.c | 12 ++++++++---- xen/arch/x86/domain_page.c | 13 +++++-------- xen/arch/x86/hvm/hvm.c | 5 ----- xen/arch/x86/include/asm/domain.h | 2 +- xen/arch/x86/include/asm/mm.h | 4 ++-- xen/arch/x86/mm.c | 6 ++++-- xen/arch/x86/pv/domain.c | 2 +- xen/arch/x86/x86_64/mm.c | 2 +- 8 files changed, 22 insertions(+), 24 deletions(-) diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -XXX,XX +XXX,XX @@ int arch_vcpu_create(struct vcpu *v) v->arch.flags = TF_kernel_mode; + rc = create_perdomain_mapping(v, PERDOMAIN_VIRT_START, 0, false); + if ( rc ) + return rc; + rc = mapcache_vcpu_init(v); if ( rc ) return rc; @@ -XXX,XX +XXX,XX @@ int arch_vcpu_create(struct vcpu *v) return rc; fail: + free_perdomain_mappings(v); paging_vcpu_teardown(v); vcpu_destroy_fpu(v); xfree(v->arch.msrs); @@ -XXX,XX +XXX,XX @@ void arch_vcpu_destroy(struct vcpu *v) hvm_vcpu_destroy(v); else pv_vcpu_destroy(v); + + free_perdomain_mappings(v); } int arch_sanitise_domain_config(struct xen_domctl_createdomain *config) @@ -XXX,XX +XXX,XX @@ int arch_domain_create(struct domain *d, } else if ( is_pv_domain(d) ) { - if ( (rc = mapcache_domain_init(d)) != 0 ) - goto fail; + mapcache_domain_init(d); if ( (rc = pv_domain_initialise(d)) != 0 ) goto fail; @@ -XXX,XX +XXX,XX @@ int arch_domain_create(struct domain *d, XFREE(d->arch.cpu_policy); if ( paging_initialised ) paging_final_teardown(d); - free_perdomain_mappings(d); return rc; } @@ -XXX,XX +XXX,XX @@ void arch_domain_destroy(struct domain *d) if ( is_pv_domain(d) ) pv_domain_destroy(d); - free_perdomain_mappings(d); free_xenheap_page(d->shared_info); cleanup_domain_irq_mapping(d); diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -XXX,XX +XXX,XX @@ void unmap_domain_page(const void *ptr) local_irq_restore(flags); } -int mapcache_domain_init(struct domain *d) +void mapcache_domain_init(struct domain *d) { struct mapcache_domain *dcache = &d->arch.pv.mapcache; unsigned int bitmap_pages; @@ -XXX,XX +XXX,XX @@ int mapcache_domain_init(struct domain *d) #ifdef NDEBUG if ( !mem_hotplug && max_page <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) ) - return 0; + return; #endif BUILD_BUG_ON(MAPCACHE_VIRT_END + PAGE_SIZE * (3 + @@ -XXX,XX +XXX,XX @@ int mapcache_domain_init(struct domain *d) (bitmap_pages + 1) * PAGE_SIZE / sizeof(long); spin_lock_init(&dcache->lock); - - return create_perdomain_mapping(d, (unsigned long)dcache->inuse, - 2 * bitmap_pages + 1, false); } int mapcache_vcpu_init(struct vcpu *v) @@ -XXX,XX +XXX,XX @@ int mapcache_vcpu_init(struct vcpu *v) if ( ents > dcache->entries ) { /* Populate page tables. */ - int rc = create_perdomain_mapping(d, MAPCACHE_VIRT_START, ents, false); + int rc = create_perdomain_mapping(v, MAPCACHE_VIRT_START, ents, false); /* Populate bit maps. */ if ( !rc ) - rc = create_perdomain_mapping(d, (unsigned long)dcache->inuse, + rc = create_perdomain_mapping(v, (unsigned long)dcache->inuse, nr, true); if ( !rc ) - rc = create_perdomain_mapping(d, (unsigned long)dcache->garbage, + rc = create_perdomain_mapping(v, (unsigned long)dcache->garbage, nr, true); if ( rc ) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -XXX,XX +XXX,XX @@ int hvm_domain_initialise(struct domain *d, INIT_LIST_HEAD(&d->arch.hvm.mmcfg_regions); INIT_LIST_HEAD(&d->arch.hvm.msix_tables); - rc = create_perdomain_mapping(d, PERDOMAIN_VIRT_START, 0, false); - if ( rc ) - goto fail; - hvm_init_cacheattr_region_list(d); rc = paging_enable(d, PG_refcounts|PG_translate|PG_external); @@ -XXX,XX +XXX,XX @@ int hvm_domain_initialise(struct domain *d, XFREE(d->arch.hvm.irq); fail0: hvm_destroy_cacheattr_region_list(d); - fail: hvm_domain_relinquish_resources(d); XFREE(d->arch.hvm.io_handler); XFREE(d->arch.hvm.pl_time); diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct mapcache_domain { unsigned long *garbage; }; -int mapcache_domain_init(struct domain *d); +void mapcache_domain_init(struct domain *d); int mapcache_vcpu_init(struct vcpu *v); void mapcache_override_current(struct vcpu *v); diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -XXX,XX +XXX,XX @@ int compat_arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg); #define NIL(type) ((type *)-sizeof(type)) #define IS_NIL(ptr) (!((uintptr_t)(ptr) + sizeof(*(ptr)))) -int create_perdomain_mapping(struct domain *d, unsigned long va, +int create_perdomain_mapping(struct vcpu *v, unsigned long va, unsigned int nr, bool populate); void populate_perdomain_mapping(const struct vcpu *v, unsigned long va, mfn_t *mfn, unsigned long nr); void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va, unsigned int nr); -void free_perdomain_mappings(struct domain *d); +void free_perdomain_mappings(struct vcpu *v); void __iomem *ioremap_wc(paddr_t pa, size_t len); diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ static bool perdomain_l1e_needs_freeing(l1_pgentry_t l1e) (_PAGE_PRESENT | _PAGE_AVAIL0); } -int create_perdomain_mapping(struct domain *d, unsigned long va, +int create_perdomain_mapping(struct vcpu *v, unsigned long va, unsigned int nr, bool populate) { + struct domain *d = v->domain; struct page_info *pg; l3_pgentry_t *l3tab; l2_pgentry_t *l2tab; @@ -XXX,XX +XXX,XX @@ void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va, unmap_domain_page(l3tab); } -void free_perdomain_mappings(struct domain *d) +void free_perdomain_mappings(struct vcpu *v) { + struct domain *d = v->domain; l3_pgentry_t *l3tab; unsigned int i; diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ int switch_compat(struct domain *d) static int pv_create_gdt_ldt_l1tab(struct vcpu *v) { - return create_perdomain_mapping(v->domain, GDT_VIRT_START(v), + return create_perdomain_mapping(v, GDT_VIRT_START(v), 1U << GDT_LDT_VCPU_SHIFT, false); } diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -XXX,XX +XXX,XX @@ void __init zap_low_mappings(void) int setup_compat_arg_xlat(struct vcpu *v) { - return create_perdomain_mapping(v->domain, ARG_XLAT_START(v), + return create_perdomain_mapping(v, ARG_XLAT_START(v), PFN_UP(COMPAT_ARG_XLAT_SIZE), true); } -- 2.46.0
The current logic gates issuing flush TLB requests with the FLUSH_ROOT_PGTBL flag to XPTI being enabled. In preparation for FLUSH_ROOT_PGTBL also being needed when not using XPTI, untie it from the xpti domain boolean and instead introduce a new flush_root_pt field. No functional change intended, as flush_root_pt == xpti. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/domain.h | 2 ++ xen/arch/x86/include/asm/flushtlb.h | 2 +- xen/arch/x86/mm.c | 2 +- xen/arch/x86/pv/domain.c | 2 ++ 4 files changed, 6 insertions(+), 2 deletions(-) diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct pv_domain bool pcid; /* Mitigate L1TF with shadow/crashing? */ bool check_l1tf; + /* Issue FLUSH_ROOT_PGTBL for root page-table changes. */ + bool flush_root_pt; /* map_domain_page() mapping cache. */ struct mapcache_domain mapcache; diff --git a/xen/arch/x86/include/asm/flushtlb.h b/xen/arch/x86/include/asm/flushtlb.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/flushtlb.h +++ b/xen/arch/x86/include/asm/flushtlb.h @@ -XXX,XX +XXX,XX @@ void flush_area_mask(const cpumask_t *mask, const void *va, #define flush_root_pgtbl_domain(d) \ { \ - if ( is_pv_domain(d) && (d)->arch.pv.xpti ) \ + if ( is_pv_domain(d) && (d)->arch.pv.flush_root_pt ) \ flush_mask((d)->dirty_cpumask, FLUSH_ROOT_PGTBL); \ } diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ long do_mmu_update( cmd == MMU_PT_UPDATE_PRESERVE_AD, v); if ( !rc ) flush_linear_pt = true; - if ( !rc && pt_owner->arch.pv.xpti ) + if ( !rc && pt_owner->arch.pv.flush_root_pt ) { bool local_in_use = false; diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ int pv_domain_initialise(struct domain *d) d->arch.ctxt_switch = &pv_csw; + d->arch.pv.flush_root_pt = d->arch.pv.xpti; + if ( !is_pv_32bit_domain(d) && use_invpcid && cpu_has_pcid ) switch ( ACCESS_ONCE(opt_pcid) ) { -- 2.46.0
Move the handling of FLUSH_ROOT_PGTBL in flush_area_local() ahead of the logic that does the TLB flushing, in preparation for further changes requiring the TLB flush to be strictly done after having handled FLUSH_ROOT_PGTBL. No functional change intended. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/flushtlb.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/flushtlb.c +++ b/xen/arch/x86/flushtlb.c @@ -XXX,XX +XXX,XX @@ unsigned int flush_area_local(const void *va, unsigned int flags) { unsigned int order = (flags - 1) & FLUSH_ORDER_MASK; + if ( flags & FLUSH_ROOT_PGTBL ) + get_cpu_info()->root_pgt_changed = true; + if ( flags & (FLUSH_TLB|FLUSH_TLB_GLOBAL) ) { if ( order == 0 ) @@ -XXX,XX +XXX,XX @@ unsigned int flush_area_local(const void *va, unsigned int flags) } } - if ( flags & FLUSH_ROOT_PGTBL ) - get_cpu_info()->root_pgt_changed = true; - return flags; } -- 2.46.0
No functional change, as the option is not used. Introduced new so newly added functionality is keyed on the option being enabled, even if the feature is non-functional. When ASI is enabled for PV domains, printing the usage of XPTI might be omitted if it must be uniformly disabled given the usage of ASI. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- Changes since v1: - Improve comments and documentation about what ASI provides. - Do not print the XPTI information if ASI is used for pv domUs and dom0 is PVH, or if ASI is used for both domU and dom0. FWIW, I would print the state of XPTI uniformly, as otherwise I find the output might be confusing for user expecting to assert the state of XPTI. --- docs/misc/xen-command-line.pandoc | 19 +++++ xen/arch/x86/include/asm/domain.h | 3 + xen/arch/x86/include/asm/spec_ctrl.h | 2 + xen/arch/x86/spec_ctrl.c | 115 +++++++++++++++++++++++++-- 4 files changed, 133 insertions(+), 6 deletions(-) diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc index XXXXXXX..XXXXXXX 100644 --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -XXX,XX +XXX,XX @@ to appropriate auditing by Xen. Argo is disabled by default. This option is disabled by default, to protect domains from a DoS by a buggy or malicious other domain spamming the ring. +### asi (x86) +> `= List of [ <bool>, {pv,hvm}=<bool>, + {vcpu-pt}=<bool>|{pv,hvm}=<bool> ]` + +Offers control over whether the hypervisor will engage in Address Space +Isolation, by not having potentially sensitive information permanently mapped +in the VMM page-tables. Using this option might avoid the need to apply +mitigations for certain speculative related attacks, at the cost of mapping +sensitive information on-demand. + +* `pv=` and `hvm=` sub-options allow enabling for specific guest types. + +**WARNING: manual de-selection of enabled options will invalidate any +protection offered by the feature. The fine grained options provided below are +meant to be used for debugging purposes only.** + +* `vcpu-pt` ensure each vCPU uses a unique top-level page-table and setup a + virtual address space region to map memory on a per-vCPU basis. + ### asid (x86) > `= <boolean>` diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct arch_domain /* Don't unconditionally inject #GP for unhandled MSRs. */ bool msr_relaxed; + /* Use a per-vCPU root pt, and switch per-domain slot to per-vCPU. */ + bool vcpu_pt; + /* Emulated devices enabled bitmap. */ uint32_t emulation_flags; } __cacheline_aligned; diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/spec_ctrl.h +++ b/xen/arch/x86/include/asm/spec_ctrl.h @@ -XXX,XX +XXX,XX @@ extern uint8_t default_scf; extern int8_t opt_xpti_hwdom, opt_xpti_domu; +extern int8_t opt_vcpu_pt_pv, opt_vcpu_pt_hwdom, opt_vcpu_pt_hvm; + extern bool cpu_has_bug_l1tf; extern int8_t opt_pv_l1tf_hwdom, opt_pv_l1tf_domu; extern bool opt_bp_spec_reduce; diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/spec_ctrl.c +++ b/xen/arch/x86/spec_ctrl.c @@ -XXX,XX +XXX,XX @@ static int8_t __initdata opt_gds_mit = -1; static int8_t __initdata opt_div_scrub = -1; bool __ro_after_init opt_bp_spec_reduce = true; +/* Use a per-vCPU root page-table and switch the per-domain slot to per-vCPU. */ +int8_t __ro_after_init opt_vcpu_pt_hvm = -1; +int8_t __ro_after_init opt_vcpu_pt_hwdom = -1; +int8_t __ro_after_init opt_vcpu_pt_pv = -1; + static int __init cf_check parse_spec_ctrl(const char *s) { const char *ss; @@ -XXX,XX +XXX,XX @@ int8_t __ro_after_init opt_xpti_domu = -1; static __init void xpti_init_default(void) { + ASSERT(opt_vcpu_pt_pv >= 0 && opt_vcpu_pt_hwdom >= 0); + if ( (opt_xpti_hwdom == 1 || opt_xpti_domu == 1) && opt_vcpu_pt_pv == 1 ) + { + printk(XENLOG_ERR + "XPTI incompatible with per-vCPU page-tables, disabling ASI\n"); + opt_vcpu_pt_pv = 0; + } if ( (boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) || cpu_has_rdcl_no ) { @@ -XXX,XX +XXX,XX @@ static __init void xpti_init_default(void) else { if ( opt_xpti_hwdom < 0 ) - opt_xpti_hwdom = 1; + opt_xpti_hwdom = !opt_vcpu_pt_hwdom; if ( opt_xpti_domu < 0 ) - opt_xpti_domu = 1; + opt_xpti_domu = !opt_vcpu_pt_pv; } } @@ -XXX,XX +XXX,XX @@ static int __init cf_check parse_pv_l1tf(const char *s) } custom_param("pv-l1tf", parse_pv_l1tf); +static int __init cf_check parse_asi(const char *s) +{ + const char *ss; + int val, rc = 0; + + /* Interpret 'asi' alone in its positive boolean form. */ + if ( *s == '\0' ) + opt_vcpu_pt_pv = opt_vcpu_pt_hwdom = opt_vcpu_pt_hvm = 1; + + do { + ss = strchr(s, ','); + if ( !ss ) + ss = strchr(s, '\0'); + + val = parse_bool(s, ss); + switch ( val ) + { + case 0: + case 1: + opt_vcpu_pt_pv = opt_vcpu_pt_hwdom = opt_vcpu_pt_hvm = val; + break; + + default: + if ( (val = parse_boolean("pv", s, ss)) >= 0 ) + opt_vcpu_pt_pv = val; + else if ( (val = parse_boolean("hvm", s, ss)) >= 0 ) + opt_vcpu_pt_hvm = val; + else if ( (val = parse_boolean("vcpu-pt", s, ss)) != -1 ) + { + switch ( val ) + { + case 1: + case 0: + opt_vcpu_pt_pv = opt_vcpu_pt_hvm = opt_vcpu_pt_hwdom = val; + break; + + case -2: + s += strlen("vcpu-pt="); + if ( (val = parse_boolean("pv", s, ss)) >= 0 ) + opt_vcpu_pt_pv = val; + else if ( (val = parse_boolean("hvm", s, ss)) >= 0 ) + opt_vcpu_pt_hvm = val; + else + default: + rc = -EINVAL; + break; + } + } + else if ( *s ) + rc = -EINVAL; + break; + } + + s = ss + 1; + } while ( *ss ); + + return rc; +} +custom_param("asi", parse_asi); + static void __init print_details(enum ind_thunk thunk) { unsigned int _7d0 = 0, _7d2 = 0, e8b = 0, e21a = 0, max = 0, tmp; @@ -XXX,XX +XXX,XX @@ static void __init print_details(enum ind_thunk thunk) boot_cpu_has(X86_FEATURE_IBPB_ENTRY_PV) ? " IBPB-entry" : "", opt_bhb_entry_pv ? " BHB-entry" : ""); - printk(" XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n", - opt_xpti_hwdom ? "enabled" : "disabled", - opt_xpti_domu ? "enabled" : "disabled", - xpti_pcid_enabled() ? "" : "out"); + if ( !opt_vcpu_pt_pv || (!opt_dom0_pvh && !opt_vcpu_pt_hwdom) ) + printk(" XPTI (64-bit PV only): Dom0 %s, DomU %s (with%s PCID)\n", + opt_xpti_hwdom ? "enabled" : "disabled", + opt_xpti_domu ? "enabled" : "disabled", + xpti_pcid_enabled() ? "" : "out"); printk(" PV L1TF shadowing: Dom0 %s, DomU %s\n", opt_pv_l1tf_hwdom ? "enabled" : "disabled", opt_pv_l1tf_domu ? "enabled" : "disabled"); #endif + +#ifdef CONFIG_HVM + printk(" ASI features for HVM VMs:%s%s\n", + opt_vcpu_pt_hvm ? "" : " None", + opt_vcpu_pt_hvm ? " vCPU-PT" : ""); + +#endif +#ifdef CONFIG_PV + printk(" ASI features for PV VMs:%s%s\n", + opt_vcpu_pt_pv ? "" : " None", + opt_vcpu_pt_pv ? " vCPU-PT" : ""); + +#endif } static bool __init check_smt_enabled(void) @@ -XXX,XX +XXX,XX @@ void spec_ctrl_init_domain(struct domain *d) if ( pv ) d->arch.pv.xpti = is_hardware_domain(d) ? opt_xpti_hwdom : opt_xpti_domu; + + d->arch.vcpu_pt = is_hardware_domain(d) ? opt_vcpu_pt_hwdom + : pv ? opt_vcpu_pt_pv + : opt_vcpu_pt_hvm; } void __init init_speculation_mitigations(void) @@ -XXX,XX +XXX,XX @@ void __init init_speculation_mitigations(void) hw_smt_enabled && default_xen_spec_ctrl ) setup_force_cpu_cap(X86_FEATURE_SC_MSR_IDLE); + /* Disable all ASI options by default until feature is finished. */ + if ( opt_vcpu_pt_pv == -1 ) + opt_vcpu_pt_pv = 0; + if ( opt_vcpu_pt_hwdom == -1 ) + opt_vcpu_pt_hwdom = 0; + if ( opt_vcpu_pt_hvm == -1 ) + opt_vcpu_pt_hvm = 0; + + if ( opt_vcpu_pt_pv || opt_vcpu_pt_hvm ) + warning_add( + "Address Space Isolation is not functional, this option is\n" + "intended to be used only for development purposes.\n"); + xpti_init_default(); l1tf_calculations(); -- 2.46.0
Such table is to be used in the per-domain slot when running with Address Space Isolation enabled for the domain. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/include/asm/domain.h | 3 +++ xen/arch/x86/include/asm/mm.h | 2 +- xen/arch/x86/mm.c | 45 ++++++++++++++++++++++--------- xen/arch/x86/mm/hap/hap.c | 2 +- xen/arch/x86/mm/shadow/hvm.c | 2 +- xen/arch/x86/mm/shadow/multi.c | 2 +- xen/arch/x86/pv/dom0_build.c | 2 +- xen/arch/x86/pv/domain.c | 2 +- 8 files changed, 41 insertions(+), 19 deletions(-) diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct arch_vcpu struct vcpu_msrs *msrs; + /* ASI: per-vCPU L3 table to use in the L4 per-domain slot. */ + struct page_info *pervcpu_l3_pg; + struct { bool next_interrupt_enabled; } monitor; diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -XXX,XX +XXX,XX @@ int devalidate_page(struct page_info *page, unsigned long type, void init_xen_pae_l2_slots(l2_pgentry_t *l2t, const struct domain *d); void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn, - const struct domain *d, mfn_t sl4mfn, bool ro_mpt); + const struct vcpu *v, mfn_t sl4mfn, bool ro_mpt); bool fill_ro_mpt(mfn_t mfn); void zap_ro_mpt(mfn_t mfn); diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ static int promote_l3_table(struct page_info *page) * extended directmap. */ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn, - const struct domain *d, mfn_t sl4mfn, bool ro_mpt) + const struct vcpu *v, mfn_t sl4mfn, bool ro_mpt) { + const struct domain *d = v->domain; /* * PV vcpus need a shortened directmap. HVM and Idle vcpus get the full * directmap. @@ -XXX,XX +XXX,XX @@ void init_xen_l4_slots(l4_pgentry_t *l4t, mfn_t l4mfn, /* Slot 260: Per-domain mappings. */ l4t[l4_table_offset(PERDOMAIN_VIRT_START)] = - l4e_from_page(d->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW); + l4e_from_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg + : d->arch.perdomain_l3_pg, + __PAGE_HYPERVISOR_RW); /* Slot 4: Per-domain mappings mirror. */ BUILD_BUG_ON(IS_ENABLED(CONFIG_PV32) && @@ -XXX,XX +XXX,XX @@ static int promote_l4_table(struct page_info *page) if ( !rc ) { + /* + * Use vCPU#0 unconditionally. When not running with ASI enabled the + * per-domain table is shared between all vCPUs, so it doesn't matter + * which vCPU gets passed to init_xen_l4_slots(). When running with + * ASI enabled this L4 will not be used, as a shadow per-vCPU L4 is + * used instead. + */ init_xen_l4_slots(pl4e, l4mfn, - d, INVALID_MFN, VM_ASSIST(d, m2p_strict)); + d->vcpu[0], INVALID_MFN, VM_ASSIST(d, m2p_strict)); atomic_inc(&d->arch.pv.nr_l4_pages); } unmap_domain_page(pl4e); @@ -XXX,XX +XXX,XX @@ int create_perdomain_mapping(struct vcpu *v, unsigned long va, ASSERT(va >= PERDOMAIN_VIRT_START && va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS)); - if ( !d->arch.perdomain_l3_pg ) + if ( !v->arch.pervcpu_l3_pg && !d->arch.perdomain_l3_pg ) { pg = alloc_domheap_page(d, MEMF_no_owner); if ( !pg ) return -ENOMEM; l3tab = __map_domain_page(pg); clear_page(l3tab); - d->arch.perdomain_l3_pg = pg; + if ( d->arch.vcpu_pt ) + v->arch.pervcpu_l3_pg = pg; + else + d->arch.perdomain_l3_pg = pg; if ( !nr ) { unmap_domain_page(l3tab); @@ -XXX,XX +XXX,XX @@ int create_perdomain_mapping(struct vcpu *v, unsigned long va, else if ( !nr ) return 0; else - l3tab = __map_domain_page(d->arch.perdomain_l3_pg); + l3tab = __map_domain_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg + : d->arch.perdomain_l3_pg); ASSERT(!l3_table_offset(va ^ (va + nr * PAGE_SIZE - 1))); @@ -XXX,XX +XXX,XX @@ void populate_perdomain_mapping(const struct vcpu *v, unsigned long va, return; } - ASSERT(d->arch.perdomain_l3_pg); - l3tab = __map_domain_page(d->arch.perdomain_l3_pg); + ASSERT(d->arch.perdomain_l3_pg || v->arch.pervcpu_l3_pg); + l3tab = __map_domain_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg + : d->arch.perdomain_l3_pg); if ( unlikely(!(l3e_get_flags(l3tab[l3_table_offset(va)]) & _PAGE_PRESENT)) ) @@ -XXX,XX +XXX,XX @@ void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va, va < PERDOMAIN_VIRT_SLOT(PERDOMAIN_SLOTS)); ASSERT(!nr || !l3_table_offset(va ^ (va + nr * PAGE_SIZE - 1))); - if ( !d->arch.perdomain_l3_pg ) + if ( !d->arch.perdomain_l3_pg && !v->arch.pervcpu_l3_pg ) return; /* Use likely to force the optimization for the fast path. */ @@ -XXX,XX +XXX,XX @@ void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va, return; } - l3tab = __map_domain_page(d->arch.perdomain_l3_pg); + l3tab = __map_domain_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg + : d->arch.perdomain_l3_pg); pl3e = l3tab + l3_table_offset(va); if ( l3e_get_flags(*pl3e) & _PAGE_PRESENT ) @@ -XXX,XX +XXX,XX @@ void free_perdomain_mappings(struct vcpu *v) l3_pgentry_t *l3tab; unsigned int i; - if ( !d->arch.perdomain_l3_pg ) + if ( !v->arch.pervcpu_l3_pg && !d->arch.perdomain_l3_pg ) return; - l3tab = __map_domain_page(d->arch.perdomain_l3_pg); + l3tab = __map_domain_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg + : d->arch.perdomain_l3_pg); for ( i = 0; i < PERDOMAIN_SLOTS; ++i) if ( l3e_get_flags(l3tab[i]) & _PAGE_PRESENT ) @@ -XXX,XX +XXX,XX @@ void free_perdomain_mappings(struct vcpu *v) } unmap_domain_page(l3tab); - free_domheap_page(d->arch.perdomain_l3_pg); + free_domheap_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg + : d->arch.perdomain_l3_pg); d->arch.perdomain_l3_pg = NULL; + v->arch.pervcpu_l3_pg = NULL; } static void write_sss_token(unsigned long *ptr) diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/hap/hap.c +++ b/xen/arch/x86/mm/hap/hap.c @@ -XXX,XX +XXX,XX @@ static mfn_t hap_make_monitor_table(struct vcpu *v) m4mfn = page_to_mfn(pg); l4e = map_domain_page(m4mfn); - init_xen_l4_slots(l4e, m4mfn, d, INVALID_MFN, false); + init_xen_l4_slots(l4e, m4mfn, v, INVALID_MFN, false); unmap_domain_page(l4e); return m4mfn; diff --git a/xen/arch/x86/mm/shadow/hvm.c b/xen/arch/x86/mm/shadow/hvm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/shadow/hvm.c +++ b/xen/arch/x86/mm/shadow/hvm.c @@ -XXX,XX +XXX,XX @@ mfn_t sh_make_monitor_table(const struct vcpu *v, unsigned int shadow_levels) * shadow-linear mapping will either be inserted below when creating * lower level monitor tables, or later in sh_update_cr3(). */ - init_xen_l4_slots(l4e, m4mfn, d, INVALID_MFN, false); + init_xen_l4_slots(l4e, m4mfn, v, INVALID_MFN, false); if ( shadow_levels < 4 ) { diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/shadow/multi.c +++ b/xen/arch/x86/mm/shadow/multi.c @@ -XXX,XX +XXX,XX @@ sh_make_shadow(struct vcpu *v, mfn_t gmfn, u32 shadow_type) BUILD_BUG_ON(sizeof(l4_pgentry_t) != sizeof(shadow_l4e_t)); - init_xen_l4_slots(l4t, gmfn, d, smfn, (!is_pv_32bit_domain(d) && + init_xen_l4_slots(l4t, gmfn, v, smfn, (!is_pv_32bit_domain(d) && VM_ASSIST(d, m2p_strict))); unmap_domain_page(l4t); } diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/dom0_build.c +++ b/xen/arch/x86/pv/dom0_build.c @@ -XXX,XX +XXX,XX @@ static int __init dom0_construct(struct boot_info *bi, struct domain *d) l4start = l4tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE; clear_page(l4tab); init_xen_l4_slots(l4tab, _mfn(virt_to_mfn(l4start)), - d, INVALID_MFN, true); + d->vcpu[0], INVALID_MFN, true); v->arch.guest_table = pagetable_from_paddr(__pa(l4start)); } else diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ static int setup_compat_l4(struct vcpu *v) mfn = page_to_mfn(pg); l4tab = map_domain_page(mfn); clear_page(l4tab); - init_xen_l4_slots(l4tab, mfn, v->domain, INVALID_MFN, false); + init_xen_l4_slots(l4tab, mfn, v, INVALID_MFN, false); unmap_domain_page(l4tab); /* This page needs to look like a pagetable so that it can be shadowed */ -- 2.46.0
When using a unique per-vCPU root page table the per-domain region becomes per-vCPU, and hence the mapcache is no longer shared between all vCPUs of a domain. Introduce per-vCPU mapcache structures, and modify map_domain_page() to create per-vCPU mappings when possible. Note the lock is also not needed with using per-vCPU map caches, as the structure is no longer shared. This introduces some duplication in the domain and vcpu structures, as both contain a mapcache field to support running with and without per-vCPU page-tables. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/domain_page.c | 90 ++++++++++++++++++++----------- xen/arch/x86/include/asm/domain.h | 20 ++++--- 2 files changed, 71 insertions(+), 39 deletions(-) diff --git a/xen/arch/x86/domain_page.c b/xen/arch/x86/domain_page.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain_page.c +++ b/xen/arch/x86/domain_page.c @@ -XXX,XX +XXX,XX @@ void *map_domain_page(mfn_t mfn) struct vcpu *v; struct mapcache_domain *dcache; struct mapcache_vcpu *vcache; + struct mapcache *cache; struct vcpu_maphash_entry *hashent; + struct domain *d; #ifdef NDEBUG if ( mfn_x(mfn) <= PFN_DOWN(__pa(HYPERVISOR_VIRT_END - 1)) ) @@ -XXX,XX +XXX,XX @@ void *map_domain_page(mfn_t mfn) if ( !v || !is_pv_vcpu(v) ) return mfn_to_virt(mfn_x(mfn)); - dcache = &v->domain->arch.pv.mapcache; + d = v->domain; + dcache = &d->arch.pv.mapcache; vcache = &v->arch.pv.mapcache; - if ( !dcache->inuse ) + cache = d->arch.vcpu_pt ? &v->arch.pv.mapcache.cache + : &d->arch.pv.mapcache.cache; + if ( !cache->inuse ) return mfn_to_virt(mfn_x(mfn)); perfc_incr(map_domain_page_count); @@ -XXX,XX +XXX,XX @@ void *map_domain_page(mfn_t mfn) if ( hashent->mfn == mfn_x(mfn) ) { idx = hashent->idx; - ASSERT(idx < dcache->entries); + ASSERT(idx < cache->entries); hashent->refcnt++; ASSERT(hashent->refcnt); ASSERT(mfn_eq(l1e_get_mfn(MAPCACHE_L1ENT(idx)), mfn)); goto out; } - spin_lock(&dcache->lock); + if ( !d->arch.vcpu_pt ) + spin_lock(&dcache->lock); /* Has some other CPU caused a wrap? We must flush if so. */ - if ( unlikely(dcache->epoch != vcache->shadow_epoch) ) + if ( unlikely(!d->arch.vcpu_pt && dcache->epoch != vcache->shadow_epoch) ) { vcache->shadow_epoch = dcache->epoch; if ( NEED_FLUSH(this_cpu(tlbflush_time), dcache->tlbflush_timestamp) ) @@ -XXX,XX +XXX,XX @@ void *map_domain_page(mfn_t mfn) } } - idx = find_next_zero_bit(dcache->inuse, dcache->entries, dcache->cursor); - if ( unlikely(idx >= dcache->entries) ) + idx = find_next_zero_bit(cache->inuse, cache->entries, cache->cursor); + if ( unlikely(idx >= cache->entries) ) { unsigned long accum = 0, prev = 0; /* /First/, clean the garbage map and update the inuse list. */ - for ( i = 0; i < BITS_TO_LONGS(dcache->entries); i++ ) + for ( i = 0; i < BITS_TO_LONGS(cache->entries); i++ ) { accum |= prev; - dcache->inuse[i] &= ~xchg(&dcache->garbage[i], 0); - prev = ~dcache->inuse[i]; + cache->inuse[i] &= ~xchg(&cache->garbage[i], 0); + prev = ~cache->inuse[i]; } - if ( accum | (prev & BITMAP_LAST_WORD_MASK(dcache->entries)) ) - idx = find_first_zero_bit(dcache->inuse, dcache->entries); + if ( accum | (prev & BITMAP_LAST_WORD_MASK(cache->entries)) ) + idx = find_first_zero_bit(cache->inuse, cache->entries); else { /* Replace a hash entry instead. */ @@ -XXX,XX +XXX,XX @@ void *map_domain_page(mfn_t mfn) i = 0; } while ( i != MAPHASH_HASHFN(mfn_x(mfn)) ); } - BUG_ON(idx >= dcache->entries); + BUG_ON(idx >= cache->entries); /* /Second/, flush TLBs. */ perfc_incr(domain_page_tlb_flush); flush_tlb_local(); - vcache->shadow_epoch = ++dcache->epoch; - dcache->tlbflush_timestamp = tlbflush_current_time(); + if ( !d->arch.vcpu_pt ) + { + vcache->shadow_epoch = ++dcache->epoch; + dcache->tlbflush_timestamp = tlbflush_current_time(); + } } - set_bit(idx, dcache->inuse); - dcache->cursor = idx + 1; + set_bit(idx, cache->inuse); + cache->cursor = idx + 1; - spin_unlock(&dcache->lock); + if ( !d->arch.vcpu_pt ) + spin_unlock(&dcache->lock); l1e_write(&MAPCACHE_L1ENT(idx), l1e_from_mfn(mfn, __PAGE_HYPERVISOR_RW)); @@ -XXX,XX +XXX,XX @@ void unmap_domain_page(const void *ptr) unsigned int idx; struct vcpu *v; struct mapcache_domain *dcache; + struct mapcache *cache; unsigned long va = (unsigned long)ptr, mfn, flags; struct vcpu_maphash_entry *hashent; @@ -XXX,XX +XXX,XX @@ void unmap_domain_page(const void *ptr) ASSERT(v && is_pv_vcpu(v)); dcache = &v->domain->arch.pv.mapcache; - ASSERT(dcache->inuse); + cache = v->domain->arch.vcpu_pt ? &v->arch.pv.mapcache.cache + : &v->domain->arch.pv.mapcache.cache; + ASSERT(cache->inuse); idx = PFN_DOWN(va - MAPCACHE_VIRT_START); mfn = l1e_get_pfn(MAPCACHE_L1ENT(idx)); @@ -XXX,XX +XXX,XX @@ void unmap_domain_page(const void *ptr) hashent->mfn); l1e_write(&MAPCACHE_L1ENT(hashent->idx), l1e_empty()); /* /Second/, mark as garbage. */ - set_bit(hashent->idx, dcache->garbage); + set_bit(hashent->idx, cache->garbage); } /* Add newly-freed mapping to the maphash. */ @@ -XXX,XX +XXX,XX @@ void unmap_domain_page(const void *ptr) /* /First/, zap the PTE. */ l1e_write(&MAPCACHE_L1ENT(idx), l1e_empty()); /* /Second/, mark as garbage. */ - set_bit(idx, dcache->garbage); + set_bit(idx, cache->garbage); } local_irq_restore(flags); @@ -XXX,XX +XXX,XX @@ void unmap_domain_page(const void *ptr) void mapcache_domain_init(struct domain *d) { struct mapcache_domain *dcache = &d->arch.pv.mapcache; - unsigned int bitmap_pages; ASSERT(is_pv_domain(d)); @@ -XXX,XX +XXX,XX @@ void mapcache_domain_init(struct domain *d) return; #endif + if ( d->arch.vcpu_pt ) + return; + BUILD_BUG_ON(MAPCACHE_VIRT_END + PAGE_SIZE * (3 + 2 * PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long))) > MAPCACHE_VIRT_START + (PERDOMAIN_SLOT_MBYTES << 20)); - bitmap_pages = PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long)); - dcache->inuse = (void *)MAPCACHE_VIRT_END + PAGE_SIZE; - dcache->garbage = dcache->inuse + - (bitmap_pages + 1) * PAGE_SIZE / sizeof(long); spin_lock_init(&dcache->lock); } @@ -XXX,XX +XXX,XX @@ int mapcache_vcpu_init(struct vcpu *v) { struct domain *d = v->domain; struct mapcache_domain *dcache = &d->arch.pv.mapcache; + struct mapcache *cache; unsigned long i; - unsigned int ents = d->max_vcpus * MAPCACHE_VCPU_ENTRIES; + unsigned int ents = (d->arch.vcpu_pt ? 1 : d->max_vcpus) * + MAPCACHE_VCPU_ENTRIES; unsigned int nr = PFN_UP(BITS_TO_LONGS(ents) * sizeof(long)); - if ( !is_pv_vcpu(v) || !dcache->inuse ) + if ( !is_pv_vcpu(v) ) return 0; - if ( ents > dcache->entries ) + cache = d->arch.vcpu_pt ? &v->arch.pv.mapcache.cache + : &dcache->cache; + + if ( !cache->inuse ) + return 0; + + if ( ents > cache->entries ) { /* Populate page tables. */ int rc = create_perdomain_mapping(v, MAPCACHE_VIRT_START, ents, false); + const unsigned int bitmap_pages = + PFN_UP(BITS_TO_LONGS(MAPCACHE_ENTRIES) * sizeof(long)); + + cache->inuse = (void *)MAPCACHE_VIRT_END + PAGE_SIZE; + cache->garbage = cache->inuse + + (bitmap_pages + 1) * PAGE_SIZE / sizeof(long); + /* Populate bit maps. */ if ( !rc ) - rc = create_perdomain_mapping(v, (unsigned long)dcache->inuse, + rc = create_perdomain_mapping(v, (unsigned long)cache->inuse, nr, true); if ( !rc ) - rc = create_perdomain_mapping(v, (unsigned long)dcache->garbage, + rc = create_perdomain_mapping(v, (unsigned long)cache->garbage, nr, true); if ( rc ) return rc; - dcache->entries = ents; + cache->entries = ents; } /* Mark all maphash entries as not in use. */ diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct trap_bounce { unsigned long eip; }; +struct mapcache { + /* The number of array entries, and a cursor into the array. */ + unsigned int entries; + unsigned int cursor; + + /* Which mappings are in use, and which are garbage to reap next epoch? */ + unsigned long *inuse; + unsigned long *garbage; +}; + #define MAPHASH_ENTRIES 8 #define MAPHASH_HASHFN(pfn) ((pfn) & (MAPHASH_ENTRIES-1)) #define MAPHASHENT_NOTINUSE ((u32)~0U) @@ -XXX,XX +XXX,XX @@ struct mapcache_vcpu { uint32_t idx; uint32_t refcnt; } hash[MAPHASH_ENTRIES]; + + struct mapcache cache; }; struct mapcache_domain { - /* The number of array entries, and a cursor into the array. */ - unsigned int entries; - unsigned int cursor; - /* Protects map_domain_page(). */ spinlock_t lock; @@ -XXX,XX +XXX,XX @@ struct mapcache_domain { unsigned int epoch; u32 tlbflush_timestamp; - /* Which mappings are in use, and which are garbage to reap next epoch? */ - unsigned long *inuse; - unsigned long *garbage; + struct mapcache cache; }; void mapcache_domain_init(struct domain *d); -- 2.46.0
When running PV guests it's possible for the guest to use the same root page table (L4) for all vCPUs, which in turn will result in Xen also using the same root page table on all pCPUs that are running any domain vCPU. When using XPTI Xen switches to a per-CPU shadow L4 when running in guest context, switching to the fully populated L4 when in Xen context. Take advantage of this existing shadowing and force the usage of a per-CPU L4 that shadows the guest selected L4 when Address Space Isolation is requested for PV guests. The mapping of the guest L4 is done with a per-CPU fixmap entry, that however requires that the currently loaded L4 has the per-CPU slot setup. In order to ensure this switch to the shadow per-CPU L4 with just the Xen slots populated, and then map the guest L4 and copy the contents of the guest controlled slots. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- xen/arch/x86/flushtlb.c | 22 +++++++++++++++++ xen/arch/x86/include/asm/config.h | 6 +++++ xen/arch/x86/include/asm/domain.h | 3 +++ xen/arch/x86/include/asm/pv/mm.h | 5 ++++ xen/arch/x86/mm.c | 12 +++++++++- xen/arch/x86/mm/paging.c | 6 +++++ xen/arch/x86/pv/dom0_build.c | 10 ++++++-- xen/arch/x86/pv/domain.c | 31 +++++++++++++++++++++++- xen/arch/x86/pv/mm.c | 40 +++++++++++++++++++++++++++++++ 9 files changed, 131 insertions(+), 4 deletions(-) diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/flushtlb.c +++ b/xen/arch/x86/flushtlb.c @@ -XXX,XX +XXX,XX @@ #include <asm/nops.h> #include <asm/page.h> #include <asm/pv/domain.h> +#include <asm/pv/mm.h> #include <asm/spec_ctrl.h> /* Debug builds: Wrap frequently to stress-test the wrap logic. */ @@ -XXX,XX +XXX,XX @@ unsigned int flush_area_local(const void *va, unsigned int flags) unsigned int order = (flags - 1) & FLUSH_ORDER_MASK; if ( flags & FLUSH_ROOT_PGTBL ) + { get_cpu_info()->root_pgt_changed = true; + /* + * Use opt_vcpu_pt_pv instead of current->arch.vcpu_pt to avoid doing a + * sync_local_execstate() when per-vCPU page-tables are not enabled for + * PV. + */ + if ( opt_vcpu_pt_pv ) + { + const struct vcpu *curr; + const struct domain *curr_d; + + sync_local_execstate(); + + curr = current; + curr_d = curr->domain; + + if ( is_pv_domain(curr_d) && curr_d->arch.vcpu_pt ) + /* Update shadow root page-table ahead of doing TLB flush. */ + pv_asi_update_shadow_l4(curr); + } + } if ( flags & (FLUSH_TLB|FLUSH_TLB_GLOBAL) ) { diff --git a/xen/arch/x86/include/asm/config.h b/xen/arch/x86/include/asm/config.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/config.h +++ b/xen/arch/x86/include/asm/config.h @@ -XXX,XX +XXX,XX @@ extern unsigned long xen_phys_start; /* The address of a particular VCPU's GDT or LDT. */ #define GDT_VIRT_START(v) \ (PERDOMAIN_VIRT_START + ((v)->vcpu_id << GDT_LDT_VCPU_VA_SHIFT)) +/* + * There are 2 GDT pages reserved for Xen, but only one is used. Use the + * remaining one to map the guest L4 when running with ASI enabled. + */ +#define L4_SHADOW(v) \ + (GDT_VIRT_START(v) + ((FIRST_RESERVED_GDT_PAGE + 1) << PAGE_SHIFT)) #define LDT_VIRT_START(v) \ (GDT_VIRT_START(v) + (64*1024)) diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct pv_vcpu /* Deferred VA-based update state. */ bool need_update_runstate_area; struct vcpu_time_info pending_system_time; + + /* For ASI: page to use as L4 shadow of the guest selected L4. */ + root_pgentry_t *root_pgt; }; struct arch_vcpu diff --git a/xen/arch/x86/include/asm/pv/mm.h b/xen/arch/x86/include/asm/pv/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/pv/mm.h +++ b/xen/arch/x86/include/asm/pv/mm.h @@ -XXX,XX +XXX,XX @@ bool pv_destroy_ldt(struct vcpu *v); int validate_segdesc_page(struct page_info *page); +void pv_asi_update_shadow_l4(const struct vcpu *v); + #else #include <xen/errno.h> @@ -XXX,XX +XXX,XX @@ static inline bool pv_map_ldt_shadow_page(unsigned int off) { return false; } static inline bool pv_destroy_ldt(struct vcpu *v) { ASSERT_UNREACHABLE(); return false; } +static inline void pv_asi_update_shadow_l4(const struct vcpu *v) +{ ASSERT_UNREACHABLE(); } + #endif #endif /* __X86_PV_MM_H__ */ diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ void write_ptbase(struct vcpu *v) } else { + if ( is_pv_domain(d) && d->arch.vcpu_pt ) + pv_asi_update_shadow_l4(v); /* Make sure to clear use_pv_cr3 and xen_cr3 before pv_cr3. */ cpu_info->use_pv_cr3 = false; cpu_info->xen_cr3 = 0; @@ -XXX,XX +XXX,XX @@ void write_ptbase(struct vcpu *v) */ pagetable_t update_cr3(struct vcpu *v) { + const struct domain *d = v->domain; mfn_t cr3_mfn; if ( paging_mode_enabled(v->domain) ) @@ -XXX,XX +XXX,XX @@ pagetable_t update_cr3(struct vcpu *v) else cr3_mfn = pagetable_get_mfn(v->arch.guest_table); - make_cr3(v, cr3_mfn); + make_cr3(v, d->arch.vcpu_pt ? virt_to_mfn(v->arch.pv.root_pgt) : cr3_mfn); + + if ( d->arch.vcpu_pt ) + { + populate_perdomain_mapping(v, L4_SHADOW(v), &cr3_mfn, 1); + if ( v == this_cpu(curr_vcpu) ) + flush_tlb_one_local(L4_SHADOW(v)); + } return pagetable_null(); } diff --git a/xen/arch/x86/mm/paging.c b/xen/arch/x86/mm/paging.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm/paging.c +++ b/xen/arch/x86/mm/paging.c @@ -XXX,XX +XXX,XX @@ int paging_domctl(struct domain *d, struct xen_domctl_shadow_op *sc, return -EINVAL; } + if ( is_pv_domain(d) && d->arch.vcpu_pt ) + { + gprintk(XENLOG_ERR, "Paging not supported on PV domains with ASI\n"); + return -EOPNOTSUPP; + } + if ( resuming ? (d->arch.paging.preempt.dom != current->domain || d->arch.paging.preempt.op != sc->op) diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/dom0_build.c +++ b/xen/arch/x86/pv/dom0_build.c @@ -XXX,XX +XXX,XX @@ static int __init dom0_construct(struct boot_info *bi, struct domain *d) d->arch.paging.mode = 0; - /* Set up CR3 value for switch_cr3_cr4(). */ - update_cr3(v); + /* + * Set up CR3 value for switch_cr3_cr4(). Use make_cr3() instead of + * update_cr3() to avoid using an ASI page-table for dom0 building. + */ + make_cr3(v, pagetable_get_mfn(v->arch.guest_table)); /* We run on dom0's page tables for the final part of the build process. */ switch_cr3_cr4(cr3_pa(v->arch.cr3), read_cr4()); @@ -XXX,XX +XXX,XX @@ static int __init dom0_construct(struct boot_info *bi, struct domain *d) } #endif + /* Must be called in case ASI is enabled. */ + update_cr3(v); + v->is_initialised = 1; clear_bit(_VPF_down, &v->pause_flags); diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/domain.c +++ b/xen/arch/x86/pv/domain.c @@ -XXX,XX +XXX,XX @@ #include <asm/invpcid.h> #include <asm/spec_ctrl.h> #include <asm/pv/domain.h> +#include <asm/pv/mm.h> #include <asm/shadow.h> #ifdef CONFIG_PV32 @@ -XXX,XX +XXX,XX @@ void pv_vcpu_destroy(struct vcpu *v) pv_destroy_gdt_ldt_l1tab(v); XFREE(v->arch.pv.trap_ctxt); + FREE_XENHEAP_PAGE(v->arch.pv.root_pgt); } int pv_vcpu_initialise(struct vcpu *v) @@ -XXX,XX +XXX,XX @@ int pv_vcpu_initialise(struct vcpu *v) goto done; } + if ( d->arch.vcpu_pt ) + { + v->arch.pv.root_pgt = alloc_xenheap_page(); + if ( !v->arch.pv.root_pgt ) + { + rc = -ENOMEM; + goto done; + } + + /* + * VM assists are not yet known, RO machine-to-phys slot will be copied + * from the guest L4. + */ + init_xen_l4_slots(v->arch.pv.root_pgt, + _mfn(virt_to_mfn(v->arch.pv.root_pgt)), + v, INVALID_MFN, false); + } + done: if ( rc ) pv_vcpu_destroy(v); @@ -XXX,XX +XXX,XX @@ int pv_domain_initialise(struct domain *d) d->arch.ctxt_switch = &pv_csw; - d->arch.pv.flush_root_pt = d->arch.pv.xpti; + d->arch.pv.flush_root_pt = d->arch.pv.xpti || d->arch.vcpu_pt; if ( !is_pv_32bit_domain(d) && use_invpcid && cpu_has_pcid ) switch ( ACCESS_ONCE(opt_pcid) ) @@ -XXX,XX +XXX,XX @@ bool __init xpti_pcid_enabled(void) static void _toggle_guest_pt(struct vcpu *v) { + const struct domain *d = v->domain; bool guest_update; pagetable_t old_shadow; unsigned long cr3; @@ -XXX,XX +XXX,XX @@ static void _toggle_guest_pt(struct vcpu *v) guest_update = v->arch.flags & TF_kernel_mode; old_shadow = update_cr3(v); + if ( d->arch.vcpu_pt ) + /* + * _toggle_guest_pt() might switch between user and kernel page tables, + * but doesn't use write_ptbase(), and hence needs an explicit call to + * sync the shadow L4. + */ + pv_asi_update_shadow_l4(v); + /* * Don't flush user global mappings from the TLB. Don't tick TLB clock. * diff --git a/xen/arch/x86/pv/mm.c b/xen/arch/x86/pv/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/pv/mm.c +++ b/xen/arch/x86/pv/mm.c @@ -XXX,XX +XXX,XX @@ #include <asm/current.h> #include <asm/p2m.h> +#include <asm/pv/domain.h> #include "mm.h" @@ -XXX,XX +XXX,XX @@ void init_xen_pae_l2_slots(l2_pgentry_t *l2t, const struct domain *d) } #endif +void pv_asi_update_shadow_l4(const struct vcpu *v) +{ + const root_pgentry_t *guest_pgt; + root_pgentry_t *root_pgt = v->arch.pv.root_pgt; + const struct domain *d = v->domain; + + ASSERT(!d->arch.pv.xpti); + ASSERT(is_pv_domain(d)); + ASSERT(!is_idle_domain(d)); + ASSERT(current == this_cpu(curr_vcpu)); + + if ( likely(v == current) ) + guest_pgt = (void *)L4_SHADOW(v); + else if ( !(v->arch.flags & TF_kernel_mode) ) + guest_pgt = + map_domain_page(pagetable_get_mfn(v->arch.guest_table_user)); + else + guest_pgt = map_domain_page(pagetable_get_mfn(v->arch.guest_table)); + + if ( is_pv_64bit_domain(d) ) + { + unsigned int i; + + for ( i = 0; i < ROOT_PAGETABLE_FIRST_XEN_SLOT; i++ ) + l4e_write(&root_pgt[i], guest_pgt[i]); + for ( i = ROOT_PAGETABLE_LAST_XEN_SLOT + 1; + i < L4_PAGETABLE_ENTRIES; i++ ) + l4e_write(&root_pgt[i], guest_pgt[i]); + + l4e_write(&root_pgt[l4_table_offset(RO_MPT_VIRT_START)], + guest_pgt[l4_table_offset(RO_MPT_VIRT_START)]); + } + else + l4e_write(&root_pgt[0], guest_pgt[0]); + + if ( v != this_cpu(curr_vcpu) ) + unmap_domain_page(guest_pgt); +} + /* * Local variables: * mode: C -- 2.46.0
When using ASI the CPU stack is mapped using a range of fixmap entries in the per-CPU region. This ensures the stack is only accessible by the current CPU. Note however there's further work required in order to allocate the stack from domheap instead of xenheap, and ensure the stack is not part of the direct map. For domains not running with ASI enabled all the CPU stacks are mapped in the per-domain L3, so that the stack is always at the same linear address, regardless of whether ASI is enabled or not for the domain. When calling UEFI runtime methods the current per-domain slot needs to be added to the EFI L4, so that the stack is available in UEFI. Finally, some users of callfunc IPIs pass parameters from the stack, so when handling a callfunc IPI the stack of the caller CPU is mapped into the address space of the CPU handling the IPI. This needs further work to use a bounce buffer in order to avoid having to map remote CPU stacks. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- There's also further work required in order to avoid mapping remote stack when handling callfunc IPIs. --- docs/misc/xen-command-line.pandoc | 5 +- xen/arch/x86/domain.c | 30 ++++++++++++ xen/arch/x86/include/asm/config.h | 10 +++- xen/arch/x86/include/asm/current.h | 5 ++ xen/arch/x86/include/asm/domain.h | 3 ++ xen/arch/x86/include/asm/mm.h | 2 +- xen/arch/x86/include/asm/smp.h | 12 +++++ xen/arch/x86/include/asm/spec_ctrl.h | 1 + xen/arch/x86/mm.c | 69 ++++++++++++++++++++++------ xen/arch/x86/setup.c | 32 ++++++++++--- xen/arch/x86/smp.c | 39 ++++++++++++++++ xen/arch/x86/smpboot.c | 20 +++++++- xen/arch/x86/spec_ctrl.c | 67 +++++++++++++++++++++++---- xen/arch/x86/traps.c | 8 +++- xen/common/smp.c | 10 ++++ xen/common/stop_machine.c | 10 ++++ xen/include/xen/smp.h | 8 ++++ 17 files changed, 295 insertions(+), 36 deletions(-) diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc index XXXXXXX..XXXXXXX 100644 --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -XXX,XX +XXX,XX @@ to appropriate auditing by Xen. Argo is disabled by default. ### asi (x86) > `= List of [ <bool>, {pv,hvm}=<bool>, - {vcpu-pt}=<bool>|{pv,hvm}=<bool> ]` + {vcpu-pt,cpu-stack}=<bool>|{pv,hvm}=<bool> ]` Offers control over whether the hypervisor will engage in Address Space Isolation, by not having potentially sensitive information permanently mapped @@ -XXX,XX +XXX,XX @@ meant to be used for debugging purposes only.** * `vcpu-pt` ensure each vCPU uses a unique top-level page-table and setup a virtual address space region to map memory on a per-vCPU basis. +* `cpu-stack` prevent CPUs from having permanent mappings of stacks different + than their own. Depends on the `vcpu-pt` option. + ### asid (x86) > `= <boolean>` diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -XXX,XX +XXX,XX @@ int arch_vcpu_create(struct vcpu *v) if ( rc ) return rc; + if ( opt_cpu_stack_hvm || opt_cpu_stack_pv ) + { + if ( is_idle_vcpu(v) || d->arch.cpu_stack ) + create_perdomain_mapping(v, PCPU_STACK_VIRT(0), + nr_cpu_ids << STACK_ORDER, false); + else if ( !v->vcpu_id ) + { + l3_pgentry_t *idle_perdomain = + __map_domain_page(idle_vcpu[0]->domain->arch.perdomain_l3_pg); + l3_pgentry_t *guest_perdomain = + __map_domain_page(d->arch.perdomain_l3_pg); + + l3e_write(&guest_perdomain[PCPU_STACK_SLOT], + idle_perdomain[PCPU_STACK_SLOT]); + + unmap_domain_page(guest_perdomain); + unmap_domain_page(idle_perdomain); + } + } + rc = mapcache_vcpu_init(v); if ( rc ) return rc; @@ -XXX,XX +XXX,XX @@ static void __context_switch(struct vcpu *n) } vcpu_restore_fpu_nonlazy(n, false); nd->arch.ctxt_switch->to(n); + if ( nd->arch.cpu_stack ) + { + /* + * Tear down previous stack mappings and map current pCPU stack. + * This is safe because not yet running on 'n' page-tables. + */ + destroy_perdomain_mapping(n, PCPU_STACK_VIRT(0), + nr_cpu_ids << STACK_ORDER); + vcpu_set_stack_mappings(n, cpu, true); + } } psr_ctxt_switch_to(nd); diff --git a/xen/arch/x86/include/asm/config.h b/xen/arch/x86/include/asm/config.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/config.h +++ b/xen/arch/x86/include/asm/config.h @@ -XXX,XX +XXX,XX @@ /* Slot 260: per-domain mappings (including map cache). */ #define PERDOMAIN_VIRT_START (PML4_ADDR(260)) #define PERDOMAIN_SLOT_MBYTES (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER)) -#define PERDOMAIN_SLOTS 3 +#define PERDOMAIN_SLOTS 4 #define PERDOMAIN_VIRT_SLOT(s) (PERDOMAIN_VIRT_START + (s) * \ (PERDOMAIN_SLOT_MBYTES << 20)) /* Slot 4: mirror of per-domain mappings (for compat xlat area accesses). */ @@ -XXX,XX +XXX,XX @@ extern unsigned long xen_phys_start; #define ARG_XLAT_START(v) \ (ARG_XLAT_VIRT_START + ((v)->vcpu_id << ARG_XLAT_VA_SHIFT)) +/* Per-CPU stacks area when using ASI. */ +#define PCPU_STACK_SLOT 3 +#define PCPU_STACK_VIRT_START PERDOMAIN_VIRT_SLOT(PCPU_STACK_SLOT) +#define PCPU_STACK_VIRT_END (PCPU_STACK_VIRT_START + \ + (PERDOMAIN_SLOT_MBYTES << 20)) +#define PCPU_STACK_VIRT(cpu) (PCPU_STACK_VIRT_START + \ + (cpu << STACK_ORDER) * PAGE_SIZE) + #define ELFSIZE 64 #define ARCH_CRASH_SAVE_VMCOREINFO diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/current.h +++ b/xen/arch/x86/include/asm/current.h @@ -XXX,XX +XXX,XX @@ * 0 - IST Shadow Stacks (4x 1k, read-only) */ +static inline bool is_shstk_slot(unsigned int i) +{ + return (i == 0 || i == PRIMARY_SHSTK_SLOT); +} + /* * Identify which stack page the stack pointer is on. Returns an index * as per the comment above. diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct arch_domain /* Use a per-vCPU root pt, and switch per-domain slot to per-vCPU. */ bool vcpu_pt; + /* Use per-CPU mapped stacks. */ + bool cpu_stack; + /* Emulated devices enabled bitmap. */ uint32_t emulation_flags; } __cacheline_aligned; diff --git a/xen/arch/x86/include/asm/mm.h b/xen/arch/x86/include/asm/mm.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/mm.h +++ b/xen/arch/x86/include/asm/mm.h @@ -XXX,XX +XXX,XX @@ extern struct rangeset *mmio_ro_ranges; #define compat_pfn_to_cr3(pfn) (((unsigned)(pfn) << 12) | ((unsigned)(pfn) >> 20)) #define compat_cr3_to_pfn(cr3) (((unsigned)(cr3) >> 12) | ((unsigned)(cr3) << 20)) -void memguard_guard_stack(void *p); +void memguard_guard_stack(void *p, unsigned int cpu); void memguard_unguard_stack(void *p); /* diff --git a/xen/arch/x86/include/asm/smp.h b/xen/arch/x86/include/asm/smp.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/smp.h +++ b/xen/arch/x86/include/asm/smp.h @@ -XXX,XX +XXX,XX @@ extern bool unaccounted_cpus; void *cpu_alloc_stack(unsigned int cpu); +/* + * Setup the per-CPU area stack mappings. + * + * @v: vCPU where the mappings are to appear. + * @stack_cpu: CPU whose stacks should be mapped. + * @map_shstk: create mappings for shadow stack regions. + */ +void vcpu_set_stack_mappings(const struct vcpu *v, unsigned int stack_cpu, + bool map_shstk); + +#define HAS_ARCH_SMP_CALLFUNC_PREAMBLE + #endif /* !__ASSEMBLY__ */ #endif diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/spec_ctrl.h +++ b/xen/arch/x86/include/asm/spec_ctrl.h @@ -XXX,XX +XXX,XX @@ extern uint8_t default_scf; extern int8_t opt_xpti_hwdom, opt_xpti_domu; extern int8_t opt_vcpu_pt_pv, opt_vcpu_pt_hwdom, opt_vcpu_pt_hvm; +extern int8_t opt_cpu_stack_pv, opt_cpu_stack_hwdom, opt_cpu_stack_hvm; extern bool cpu_has_bug_l1tf; extern int8_t opt_pv_l1tf_hwdom, opt_pv_l1tf_domu; diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/mm.c +++ b/xen/arch/x86/mm.c @@ -XXX,XX +XXX,XX @@ * doing the final put_page(), and remove it from the iommu if so. */ +#include <xen/cpu.h> #include <xen/init.h> #include <xen/ioreq.h> #include <xen/kernel.h> @@ -XXX,XX +XXX,XX @@ int create_perdomain_mapping(struct vcpu *v, unsigned long va, return rc; } -void populate_perdomain_mapping(const struct vcpu *v, unsigned long va, - mfn_t *mfn, unsigned long nr) +static void populate_perdomain_mapping_flags(const struct vcpu *v, + unsigned long va, mfn_t *mfn, + unsigned long nr, + unsigned int flags) { l1_pgentry_t *l1tab = NULL, *pl1e; const l3_pgentry_t *l3tab; @@ -XXX,XX +XXX,XX @@ void populate_perdomain_mapping(const struct vcpu *v, unsigned long va, ASSERT_UNREACHABLE(); free_domheap_page(l1e_get_page(*pl1e)); } - l1e_write(pl1e, l1e_from_mfn(mfn[i], __PAGE_HYPERVISOR_RW)); + l1e_write(pl1e, l1e_from_mfn(mfn[i], flags)); } return; @@ -XXX,XX +XXX,XX @@ void populate_perdomain_mapping(const struct vcpu *v, unsigned long va, free_domheap_page(l1e_get_page(*pl1e)); } - l1e_write(pl1e, l1e_from_mfn(*mfn, __PAGE_HYPERVISOR_RW)); + l1e_write(pl1e, l1e_from_mfn(*mfn, flags)); } unmap_domain_page(l1tab); @@ -XXX,XX +XXX,XX @@ void populate_perdomain_mapping(const struct vcpu *v, unsigned long va, unmap_domain_page(l3tab); } +void populate_perdomain_mapping(const struct vcpu *v, unsigned long va, + mfn_t *mfn, unsigned long nr) +{ + populate_perdomain_mapping_flags(v, va, mfn, nr, __PAGE_HYPERVISOR_RW); +} + +void vcpu_set_stack_mappings(const struct vcpu *v, unsigned int stack_cpu, + bool map_shstk) +{ + unsigned int i; + + for ( i = 0; i < (1U << STACK_ORDER); i++ ) + { + unsigned int flags = is_shstk_slot(i) ? __PAGE_HYPERVISOR_SHSTK + : __PAGE_HYPERVISOR_RW; + mfn_t mfn = virt_to_mfn(stack_base[stack_cpu] + i * PAGE_SIZE); + + if ( is_shstk_slot(i) && !map_shstk ) + continue; + + populate_perdomain_mapping_flags(v, + PCPU_STACK_VIRT(stack_cpu) + i * PAGE_SIZE, &mfn, 1, flags); + } +} + void destroy_perdomain_mapping(const struct vcpu *v, unsigned long va, unsigned int nr) { @@ -XXX,XX +XXX,XX @@ void free_perdomain_mappings(struct vcpu *v) l3tab = __map_domain_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg : d->arch.perdomain_l3_pg); - for ( i = 0; i < PERDOMAIN_SLOTS; ++i) + for ( i = 0; i < PERDOMAIN_SLOTS; ++i ) + { + if ( i == PCPU_STACK_SLOT && !d->arch.cpu_stack ) + /* Without ASI the stack L3e is shared with the idle page-tables. */ + continue; + if ( l3e_get_flags(l3tab[i]) & _PAGE_PRESENT ) { struct page_info *l2pg = l3e_get_page(l3tab[i]); @@ -XXX,XX +XXX,XX @@ void free_perdomain_mappings(struct vcpu *v) unmap_domain_page(l2tab); free_domheap_page(l2pg); } + } unmap_domain_page(l3tab); free_domheap_page(d->arch.vcpu_pt ? v->arch.pervcpu_l3_pg @@ -XXX,XX +XXX,XX @@ void free_perdomain_mappings(struct vcpu *v) v->arch.pervcpu_l3_pg = NULL; } -static void write_sss_token(unsigned long *ptr) +static void write_sss_token(unsigned long *ptr, unsigned long va) { /* * A supervisor shadow stack token is its own linear address, with the * busy bit (0) clear. */ - *ptr = (unsigned long)ptr; + *ptr = va; } -void memguard_guard_stack(void *p) +void memguard_guard_stack(void *p, unsigned int cpu) { + unsigned long va = + (opt_cpu_stack_hvm || opt_cpu_stack_pv) ? PCPU_STACK_VIRT(cpu) + : (unsigned long)p; + /* IST Shadow stacks. 4x 1k in stack page 0. */ if ( IS_ENABLED(CONFIG_XEN_SHSTK) ) { - write_sss_token(p + (IST_MCE * IST_SHSTK_SIZE) - 8); - write_sss_token(p + (IST_NMI * IST_SHSTK_SIZE) - 8); - write_sss_token(p + (IST_DB * IST_SHSTK_SIZE) - 8); - write_sss_token(p + (IST_DF * IST_SHSTK_SIZE) - 8); + write_sss_token(p + (IST_MCE * IST_SHSTK_SIZE) - 8, + va + (IST_MCE * IST_SHSTK_SIZE) - 8); + write_sss_token(p + (IST_NMI * IST_SHSTK_SIZE) - 8, + va + (IST_NMI * IST_SHSTK_SIZE) - 8); + write_sss_token(p + (IST_DB * IST_SHSTK_SIZE) - 8, + va + (IST_DB * IST_SHSTK_SIZE) - 8); + write_sss_token(p + (IST_DF * IST_SHSTK_SIZE) - 8, + va + (IST_DF * IST_SHSTK_SIZE) - 8); } map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK); /* Primary Shadow Stack. 1x 4k in stack page 5. */ p += PRIMARY_SHSTK_SLOT * PAGE_SIZE; + va += PRIMARY_SHSTK_SLOT * PAGE_SIZE; if ( IS_ENABLED(CONFIG_XEN_SHSTK) ) - write_sss_token(p + PAGE_SIZE - 8); + write_sss_token(p + PAGE_SIZE - 8, va + PAGE_SIZE - 8); map_pages_to_xen((unsigned long)p, virt_to_mfn(p), 1, PAGE_HYPERVISOR_SHSTK); } diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -XXX,XX +XXX,XX @@ static void __init init_idle_domain(void) scheduler_init(); set_current(idle_vcpu[0]); this_cpu(curr_vcpu) = current; + if ( opt_cpu_stack_hvm || opt_cpu_stack_pv ) + /* Set per-domain slot in the idle page-tables to access stack mappings. */ + l4e_write(&idle_pg_table[l4_table_offset(PERDOMAIN_VIRT_START)], + l4e_from_page(idle_vcpu[0]->domain->arch.perdomain_l3_pg, + __PAGE_HYPERVISOR_RW)); } void srat_detect_node(int cpu) @@ -XXX,XX +XXX,XX @@ static void __init noreturn reinit_bsp_stack(void) /* Update SYSCALL trampolines */ percpu_traps_init(); - stack_base[0] = stack; - rc = setup_cpu_root_pgt(0); if ( rc ) panic("Error %d setting up PV root page table\n", rc); @@ -XXX,XX +XXX,XX @@ void asmlinkage __init noreturn __start_xen(void) system_state = SYS_STATE_boot; - bsp_stack = cpu_alloc_stack(0); - if ( !bsp_stack ) - panic("No memory for BSP stack\n"); - console_init_ring(); vesa_init(); @@ -XXX,XX +XXX,XX @@ void asmlinkage __init noreturn __start_xen(void) alternative_branches(); + /* + * Alloc the BSP stack closer to the point where the AP ones also get + * allocated - and after the speculation mitigations have been initialized. + * In order to set up the shadow stack token correctly Xen needs to know + * whether per-CPU mapped stacks are being used. + */ + bsp_stack = cpu_alloc_stack(0); + if ( !bsp_stack ) + panic("No memory for BSP stack\n"); + /* * NB: when running as a PV shim VCPUOP_up/down is wired to the shim * physical cpu_add/remove functions, so launch the guest with only @@ -XXX,XX +XXX,XX @@ void asmlinkage __init noreturn __start_xen(void) info->last_spec_ctrl = default_xen_spec_ctrl; } + stack_base[0] = bsp_stack; + /* Copy the cpu info block, and move onto the BSP stack. */ - bsp_info = get_cpu_info_from_stack((unsigned long)bsp_stack); + if ( opt_cpu_stack_hvm || opt_cpu_stack_pv ) + { + vcpu_set_stack_mappings(idle_vcpu[0], 0, true); + bsp_info = get_cpu_info_from_stack(PCPU_STACK_VIRT(0)); + } + else + bsp_info = get_cpu_info_from_stack((unsigned long)bsp_stack); + *bsp_info = *info; asm volatile ("mov %[stk], %%rsp; jmp %c[fn]" :: diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/smp.c +++ b/xen/arch/x86/smp.c @@ -XXX,XX +XXX,XX @@ */ #include <xen/cpu.h> +#include <xen/efi.h> #include <xen/irq.h> #include <xen/sched.h> #include <xen/delay.h> @@ -XXX,XX +XXX,XX @@ #include <asm/hpet.h> #include <asm/setup.h> +#include <asm/spec_ctrl.h> + /* Helper functions to prepare APIC register values. */ static unsigned int prepare_ICR(unsigned int shortcut, int vector) { @@ -XXX,XX +XXX,XX @@ long cf_check cpu_down_helper(void *data) ret = cpu_down(cpu); return ret; } + +void arch_smp_pre_callfunc(unsigned int cpu) +{ + if ( !opt_cpu_stack_hvm && !opt_cpu_stack_pv ) + /* + * Avoid the unconditional sync_local_execstate() call below if ASI is + * not enabled for any domain. + */ + return; + + /* + * Sync execution state, so that the page-tables cannot change while + * creating or destroying the stack mappings. + */ + sync_local_execstate(); + if ( cpu == smp_processor_id() || !current->domain->arch.cpu_stack || + /* EFI page-tables have all pCPU stacks mapped. */ + efi_rs_using_pgtables() ) + return; + + vcpu_set_stack_mappings(current, cpu, false); +} + +void arch_smp_post_callfunc(unsigned int cpu) +{ + if ( cpu == smp_processor_id() || !current->domain->arch.cpu_stack || + /* EFI page-tables have all pCPU stacks mapped. */ + efi_rs_using_pgtables() ) + return; + + ASSERT(current == this_cpu(curr_vcpu)); + destroy_perdomain_mapping(current, PCPU_STACK_VIRT(cpu), + (1U << STACK_ORDER)); + + flush_area_local((void *)PCPU_STACK_VIRT(cpu), FLUSH_ORDER(STACK_ORDER)); +} diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/smpboot.c +++ b/xen/arch/x86/smpboot.c @@ -XXX,XX +XXX,XX @@ static int do_boot_cpu(int apicid, int cpu) printk("Booting processor %d/%d eip %lx\n", cpu, apicid, start_eip); - stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info); + if ( opt_cpu_stack_hvm || opt_cpu_stack_pv ) + { + /* + * Uniformly run with the stack mappings in the per-domain area if ASI + * is enabled for any domain type. + */ + vcpu_set_stack_mappings(idle_vcpu[cpu], cpu, true); + + ASSERT(IS_ALIGNED(PCPU_STACK_VIRT(cpu), STACK_SIZE)); + + stack_start = (void *)PCPU_STACK_VIRT(cpu) + STACK_SIZE - + sizeof(struct cpu_info); + } + else + stack_start = stack_base[cpu] + STACK_SIZE - sizeof(struct cpu_info); /* This grunge runs the startup process for the targeted processor. */ @@ -XXX,XX +XXX,XX @@ void *cpu_alloc_stack(unsigned int cpu) stack = alloc_xenheap_pages(STACK_ORDER, memflags); if ( stack ) - memguard_guard_stack(stack); + memguard_guard_stack(stack, cpu); return stack; } @@ -XXX,XX +XXX,XX @@ static struct notifier_block cpu_smpboot_nfb = { void __init smp_prepare_cpus(void) { + BUILD_BUG_ON(PCPU_STACK_VIRT(CONFIG_NR_CPUS) > PCPU_STACK_VIRT_END); + register_cpu_notifier(&cpu_smpboot_nfb); mtrr_aps_sync_begin(); diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/spec_ctrl.c +++ b/xen/arch/x86/spec_ctrl.c @@ -XXX,XX +XXX,XX @@ bool __ro_after_init opt_bp_spec_reduce = true; int8_t __ro_after_init opt_vcpu_pt_hvm = -1; int8_t __ro_after_init opt_vcpu_pt_hwdom = -1; int8_t __ro_after_init opt_vcpu_pt_pv = -1; +/* Per-CPU stacks. */ +int8_t __ro_after_init opt_cpu_stack_hvm = -1; +int8_t __ro_after_init opt_cpu_stack_hwdom = -1; +int8_t __ro_after_init opt_cpu_stack_pv = -1; static int __init cf_check parse_spec_ctrl(const char *s) { @@ -XXX,XX +XXX,XX @@ static __init void xpti_init_default(void) printk(XENLOG_ERR "XPTI incompatible with per-vCPU page-tables, disabling ASI\n"); opt_vcpu_pt_pv = 0; + opt_cpu_stack_pv = 0; } if ( (boot_cpu_data.x86_vendor & (X86_VENDOR_AMD | X86_VENDOR_HYGON)) || cpu_has_rdcl_no ) @@ -XXX,XX +XXX,XX @@ static int __init cf_check parse_asi(const char *s) /* Interpret 'asi' alone in its positive boolean form. */ if ( *s == '\0' ) + { opt_vcpu_pt_pv = opt_vcpu_pt_hwdom = opt_vcpu_pt_hvm = 1; + opt_cpu_stack_pv = opt_cpu_stack_hwdom = opt_cpu_stack_hvm = 1; + } do { ss = strchr(s, ','); @@ -XXX,XX +XXX,XX @@ static int __init cf_check parse_asi(const char *s) case 0: case 1: opt_vcpu_pt_pv = opt_vcpu_pt_hwdom = opt_vcpu_pt_hvm = val; + opt_cpu_stack_pv = opt_cpu_stack_hvm = opt_cpu_stack_hwdom = val; break; default: if ( (val = parse_boolean("pv", s, ss)) >= 0 ) - opt_vcpu_pt_pv = val; + opt_cpu_stack_pv = opt_vcpu_pt_pv = val; else if ( (val = parse_boolean("hvm", s, ss)) >= 0 ) - opt_vcpu_pt_hvm = val; + opt_cpu_stack_hvm = opt_vcpu_pt_hvm = val; else if ( (val = parse_boolean("vcpu-pt", s, ss)) != -1 ) { switch ( val ) @@ -XXX,XX +XXX,XX @@ static int __init cf_check parse_asi(const char *s) break; } } + else if ( (val = parse_boolean("cpu-stack", s, ss)) != -1 ) + { + switch ( val ) + { + case 1: + case 0: + opt_cpu_stack_pv = opt_cpu_stack_hvm = + opt_cpu_stack_hwdom = val; + break; + + case -2: + s += strlen("cpu-stack="); + if ( (val = parse_boolean("pv", s, ss)) >= 0 ) + opt_cpu_stack_pv = val; + else if ( (val = parse_boolean("hvm", s, ss)) >= 0 ) + opt_cpu_stack_hvm = val; + else + default: + rc = -EINVAL; + break; + } + } else if ( *s ) rc = -EINVAL; break; @@ -XXX,XX +XXX,XX @@ static int __init cf_check parse_asi(const char *s) s = ss + 1; } while ( *ss ); + /* Per-CPU stacks depends on per-vCPU mappings. */ + if ( opt_cpu_stack_pv == 1 ) + opt_vcpu_pt_pv = 1; + if ( opt_cpu_stack_hvm == 1 ) + opt_vcpu_pt_hvm = 1; + if ( opt_cpu_stack_hwdom == 1 ) + opt_vcpu_pt_hwdom = 1; + return rc; } custom_param("asi", parse_asi); @@ -XXX,XX +XXX,XX @@ static void __init print_details(enum ind_thunk thunk) #endif #ifdef CONFIG_HVM - printk(" ASI features for HVM VMs:%s%s\n", - opt_vcpu_pt_hvm ? "" : " None", - opt_vcpu_pt_hvm ? " vCPU-PT" : ""); + printk(" ASI features for HVM VMs:%s%s%s\n", + opt_vcpu_pt_hvm || opt_cpu_stack_hvm ? "" : " None", + opt_vcpu_pt_hvm ? " vCPU-PT" : "", + opt_cpu_stack_hvm ? " CPU-STACK" : ""); #endif #ifdef CONFIG_PV - printk(" ASI features for PV VMs:%s%s\n", - opt_vcpu_pt_pv ? "" : " None", - opt_vcpu_pt_pv ? " vCPU-PT" : ""); - + printk(" ASI features for PV VMs:%s%s%s\n", + opt_vcpu_pt_pv || opt_cpu_stack_pv ? "" : " None", + opt_vcpu_pt_pv ? " vCPU-PT" : "", + opt_cpu_stack_pv ? " CPU-STACK" : ""); #endif } @@ -XXX,XX +XXX,XX @@ void spec_ctrl_init_domain(struct domain *d) d->arch.vcpu_pt = is_hardware_domain(d) ? opt_vcpu_pt_hwdom : pv ? opt_vcpu_pt_pv : opt_vcpu_pt_hvm; + d->arch.cpu_stack = is_hardware_domain(d) ? opt_cpu_stack_hwdom + : pv ? opt_cpu_stack_pv + : opt_cpu_stack_hvm; } void __init init_speculation_mitigations(void) @@ -XXX,XX +XXX,XX @@ void __init init_speculation_mitigations(void) opt_vcpu_pt_hwdom = 0; if ( opt_vcpu_pt_hvm == -1 ) opt_vcpu_pt_hvm = 0; + if ( opt_cpu_stack_pv == -1 ) + opt_cpu_stack_pv = 0; + if ( opt_cpu_stack_hwdom == -1 ) + opt_cpu_stack_hwdom = 0; + if ( opt_cpu_stack_hvm == -1 ) + opt_cpu_stack_hvm = 0; if ( opt_vcpu_pt_pv || opt_vcpu_pt_hvm ) warning_add( diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -XXX,XX +XXX,XX @@ #include <asm/pv/trace.h> #include <asm/pv/mm.h> #include <asm/shstk.h> +#include <asm/spec_ctrl.h> /* * opt_nmi: one of 'ignore', 'dom0', or 'fatal'. @@ -XXX,XX +XXX,XX @@ void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs) unsigned long esp = regs->rsp; unsigned long curr_stack_base = esp & ~(STACK_SIZE - 1); unsigned long esp_top, esp_bottom; + const void *stack = + (opt_cpu_stack_hvm || opt_cpu_stack_pv) ? (void *)PCPU_STACK_VIRT(cpu) + : stack_base[cpu]; - if ( _p(curr_stack_base) != stack_base[cpu] ) + if ( _p(curr_stack_base) != stack ) printk("Current stack base %p differs from expected %p\n", - _p(curr_stack_base), stack_base[cpu]); + _p(curr_stack_base), stack); esp_bottom = (esp | (STACK_SIZE - 1)) + 1; esp_top = esp_bottom - PRIMARY_STACK_SIZE; diff --git a/xen/common/smp.c b/xen/common/smp.c index XXXXXXX..XXXXXXX 100644 --- a/xen/common/smp.c +++ b/xen/common/smp.c @@ -XXX,XX +XXX,XX @@ static struct call_data_struct { void (*func) (void *info); void *info; int wait; + unsigned int caller; cpumask_t selected; } call_data; @@ -XXX,XX +XXX,XX @@ void on_selected_cpus( call_data.func = func; call_data.info = info; call_data.wait = wait; + call_data.caller = smp_processor_id(); smp_send_call_function_mask(&call_data.selected); @@ -XXX,XX +XXX,XX @@ void smp_call_function_interrupt(void) if ( !cpumask_test_cpu(cpu, &call_data.selected) ) return; + /* + * TODO: use bounce buffers to pass callfunc data, so that when using ASI + * there's no need to map remote CPU stacks. + */ + arch_smp_pre_callfunc(call_data.caller); + irq_enter(); if ( unlikely(!func) ) @@ -XXX,XX +XXX,XX @@ void smp_call_function_interrupt(void) } irq_exit(); + + arch_smp_post_callfunc(call_data.caller); } /* diff --git a/xen/common/stop_machine.c b/xen/common/stop_machine.c index XXXXXXX..XXXXXXX 100644 --- a/xen/common/stop_machine.c +++ b/xen/common/stop_machine.c @@ -XXX,XX +XXX,XX @@ enum stopmachine_state { struct stopmachine_data { unsigned int nr_cpus; + unsigned int caller; enum stopmachine_state state; atomic_t done; @@ -XXX,XX +XXX,XX @@ int stop_machine_run(int (*fn)(void *data), void *data, unsigned int cpu) stopmachine_data.fn_result = 0; atomic_set(&stopmachine_data.done, 0); stopmachine_data.state = STOPMACHINE_START; + stopmachine_data.caller = this; smp_wmb(); @@ -XXX,XX +XXX,XX @@ static void cf_check stopmachine_action(void *data) BUG_ON(cpu != smp_processor_id()); + /* + * TODO: use bounce buffers to pass callfunc data, so that when using ASI + * there's no need to map remote CPU stacks. + */ + arch_smp_pre_callfunc(stopmachine_data.caller); + smp_mb(); while ( state != STOPMACHINE_EXIT ) @@ -XXX,XX +XXX,XX @@ static void cf_check stopmachine_action(void *data) } local_irq_enable(); + + arch_smp_post_callfunc(stopmachine_data.caller); } static int cf_check cpu_callback( diff --git a/xen/include/xen/smp.h b/xen/include/xen/smp.h index XXXXXXX..XXXXXXX 100644 --- a/xen/include/xen/smp.h +++ b/xen/include/xen/smp.h @@ -XXX,XX +XXX,XX @@ extern void *stack_base[NR_CPUS]; void initialize_cpu_data(unsigned int cpu); int setup_cpu_root_pgt(unsigned int cpu); +#ifdef HAS_ARCH_SMP_CALLFUNC_PREAMBLE +void arch_smp_pre_callfunc(unsigned int cpu); +void arch_smp_post_callfunc(unsigned int cpu); +#else +static inline void arch_smp_pre_callfunc(unsigned int cpu) {} +static inline void arch_smp_post_callfunc(unsigned int cpu) {} +#endif + #endif /* __XEN_SMP_H__ */ -- 2.46.0
With the stack mapped on a per-CPU basis there's no risk of other CPUs being able to read the stack contents, but vCPUs running on the current pCPU could read stack rubble from operations of previous vCPUs. The #DF stack is not zeroed because handling of #DF results in a panic. The contents of the shadow stack are not cleared as part of this change. It's arguable that leaking internal Xen return addresses is not guest confidential data. At most those could be used by an attacker to figure out the paths inside of Xen previous execution flows have used. Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- Is it required to zero the stack when doing a non-lazy context switch from the idle vPCU to the previously running vCPU? d0v0 -> IDLE -> sync_execstate -> zero stack? -> d0v0 This is currently done in this proposal, as when running in the idle vCPU context (iow: not lazy switched) stacks from remote pCPUs can be mapped or tasklets executed. --- Changes since v1: - Zero the stack forward to use ERMS. - Only zero the IST stacks if they have been used. - Only zero the primary stack for full context switches. --- docs/misc/xen-command-line.pandoc | 4 +- xen/arch/x86/cpu/mcheck/mce.c | 4 ++ xen/arch/x86/domain.c | 13 ++++++- xen/arch/x86/include/asm/current.h | 53 +++++++++++++++++++++++--- xen/arch/x86/include/asm/domain.h | 3 ++ xen/arch/x86/include/asm/spec_ctrl.h | 1 + xen/arch/x86/spec_ctrl.c | 57 ++++++++++++++++++++++++---- xen/arch/x86/traps.c | 5 +++ 8 files changed, 124 insertions(+), 16 deletions(-) diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc index XXXXXXX..XXXXXXX 100644 --- a/docs/misc/xen-command-line.pandoc +++ b/docs/misc/xen-command-line.pandoc @@ -XXX,XX +XXX,XX @@ to appropriate auditing by Xen. Argo is disabled by default. ### asi (x86) > `= List of [ <bool>, {pv,hvm}=<bool>, - {vcpu-pt,cpu-stack}=<bool>|{pv,hvm}=<bool> ]` + {vcpu-pt,cpu-stack,zero-stack}=<bool>|{pv,hvm}=<bool> ]` Offers control over whether the hypervisor will engage in Address Space Isolation, by not having potentially sensitive information permanently mapped @@ -XXX,XX +XXX,XX @@ meant to be used for debugging purposes only.** * `cpu-stack` prevent CPUs from having permanent mappings of stacks different than their own. Depends on the `vcpu-pt` option. +* `zero-stack` zero CPU stacks when context switching vCPUs. + ### asid (x86) > `= <boolean>` diff --git a/xen/arch/x86/cpu/mcheck/mce.c b/xen/arch/x86/cpu/mcheck/mce.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/cpu/mcheck/mce.c +++ b/xen/arch/x86/cpu/mcheck/mce.c @@ -XXX,XX +XXX,XX @@ struct mce_callbacks __ro_after_init mce_callbacks = { static const typeof(mce_callbacks.handler) __initconst_cf_clobber __used default_handler = unexpected_machine_check; +DEFINE_PER_CPU(unsigned int, slice_mce_count); + /* Call the installed machine check handler for this CPU setup. */ void do_machine_check(const struct cpu_user_regs *regs) { + this_cpu(slice_mce_count)++; + mce_enter(); alternative_vcall(mce_callbacks.handler, regs); mce_exit(); diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/domain.c +++ b/xen/arch/x86/domain.c @@ -XXX,XX +XXX,XX @@ void context_switch(struct vcpu *prev, struct vcpu *next) struct cpu_info *info = get_cpu_info(); const struct domain *prevd = prev->domain, *nextd = next->domain; unsigned int dirty_cpu = read_atomic(&next->dirty_cpu); + bool lazy = false; ASSERT(prev != next); ASSERT(local_irq_is_enabled()); @@ -XXX,XX +XXX,XX @@ void context_switch(struct vcpu *prev, struct vcpu *next) */ set_current(next); local_irq_enable(); + lazy = true; } else { @@ -XXX,XX +XXX,XX @@ void context_switch(struct vcpu *prev, struct vcpu *next) /* Ensure that the vcpu has an up-to-date time base. */ update_vcpu_system_time(next); - reset_stack_and_call_ind(nextd->arch.ctxt_switch->tail); + /* + * Context switches to the idle vCPU (either lazy or full) will never + * trigger zeroing of the stack, because the idle domain doesn't have ASI + * enabled. Switching back to the previously running vCPU after a lazy + * switch shouldn't zero the stack either. + */ + reset_stack_and_call_ind(nextd->arch.ctxt_switch->tail, + !lazy && nextd->arch.zero_stack); } void continue_running(struct vcpu *same) { - reset_stack_and_call_ind(same->domain->arch.ctxt_switch->tail); + reset_stack_and_call_ind(same->domain->arch.ctxt_switch->tail, false); } int __sync_local_execstate(void) diff --git a/xen/arch/x86/include/asm/current.h b/xen/arch/x86/include/asm/current.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/current.h +++ b/xen/arch/x86/include/asm/current.h @@ -XXX,XX +XXX,XX @@ unsigned long get_stack_dump_bottom (unsigned long sp); # define SHADOW_STACK_WORK "" #endif +#define ZERO_STACK \ + "test %[stk_size], %[stk_size];" \ + "jz .L_skip_zeroing.%=;" \ + "rep stosb;" \ + ".L_skip_zeroing.%=:" + #if __GNUC__ >= 9 # define ssaj_has_attr_noreturn(fn) __builtin_has_attribute(fn, __noreturn__) #else @@ -XXX,XX +XXX,XX @@ unsigned long get_stack_dump_bottom (unsigned long sp); # define ssaj_has_attr_noreturn(fn) true #endif -#define switch_stack_and_jump(fn, instr, constr) \ +DECLARE_PER_CPU(unsigned int, slice_mce_count); +DECLARE_PER_CPU(unsigned int, slice_nmi_count); +DECLARE_PER_CPU(unsigned int, slice_db_count); + +#define switch_stack_and_jump(fn, instr, constr, zero_stk) \ ({ \ unsigned int tmp; \ + \ BUILD_BUG_ON(!ssaj_has_attr_noreturn(fn)); \ + ASSERT(IS_ALIGNED((unsigned long)guest_cpu_user_regs() - \ + PRIMARY_STACK_SIZE + \ + sizeof(struct cpu_info), PAGE_SIZE)); \ + if ( zero_stk ) \ + { \ + unsigned long stack_top = get_stack_bottom() & \ + ~(STACK_SIZE - 1); \ + \ + if ( this_cpu(slice_mce_count) ) \ + { \ + this_cpu(slice_mce_count) = 0; \ + clear_page((void *)stack_top + IST_MCE * PAGE_SIZE); \ + } \ + if ( this_cpu(slice_nmi_count) ) \ + { \ + this_cpu(slice_nmi_count) = 0; \ + clear_page((void *)stack_top + IST_NMI * PAGE_SIZE); \ + } \ + if ( this_cpu(slice_db_count) ) \ + { \ + this_cpu(slice_db_count) = 0; \ + clear_page((void *)stack_top + IST_DB * PAGE_SIZE); \ + } \ + } \ __asm__ __volatile__ ( \ SHADOW_STACK_WORK \ "mov %[stk], %%rsp;" \ + ZERO_STACK \ CHECK_FOR_LIVEPATCH_WORK \ instr "[fun]" \ : [val] "=&r" (tmp), \ @@ -XXX,XX +XXX,XX @@ unsigned long get_stack_dump_bottom (unsigned long sp); ((PRIMARY_SHSTK_SLOT + 1) * PAGE_SIZE - 8), \ [stack_mask] "i" (STACK_SIZE - 1), \ _ASM_BUGFRAME_INFO(BUGFRAME_bug, __LINE__, \ - __FILE__, NULL) \ + __FILE__, NULL), \ + /* For stack zeroing. */ \ + "D" ((void *)guest_cpu_user_regs() - \ + PRIMARY_STACK_SIZE + sizeof(struct cpu_info)), \ + [stk_size] "c" \ + ((zero_stk) ? PRIMARY_STACK_SIZE - sizeof(struct cpu_info)\ + : 0), \ + "a" (0) \ : "memory" ); \ unreachable(); \ }) #define reset_stack_and_jump(fn) \ - switch_stack_and_jump(fn, "jmp %c", "i") + switch_stack_and_jump(fn, "jmp %c", "i", false) /* The constraint may only specify non-call-clobbered registers. */ -#define reset_stack_and_call_ind(fn) \ +#define reset_stack_and_call_ind(fn, zero_stk) \ ({ \ (void)((fn) == (void (*)(void))NULL); \ - switch_stack_and_jump(fn, "INDIRECT_CALL %", "b"); \ + switch_stack_and_jump(fn, "INDIRECT_CALL %", "b", zero_stk); \ }) /* diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/domain.h +++ b/xen/arch/x86/include/asm/domain.h @@ -XXX,XX +XXX,XX @@ struct arch_domain /* Use per-CPU mapped stacks. */ bool cpu_stack; + /* Zero CPU stack on non lazy context switch. */ + bool zero_stack; + /* Emulated devices enabled bitmap. */ uint32_t emulation_flags; } __cacheline_aligned; diff --git a/xen/arch/x86/include/asm/spec_ctrl.h b/xen/arch/x86/include/asm/spec_ctrl.h index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/include/asm/spec_ctrl.h +++ b/xen/arch/x86/include/asm/spec_ctrl.h @@ -XXX,XX +XXX,XX @@ extern int8_t opt_xpti_hwdom, opt_xpti_domu; extern int8_t opt_vcpu_pt_pv, opt_vcpu_pt_hwdom, opt_vcpu_pt_hvm; extern int8_t opt_cpu_stack_pv, opt_cpu_stack_hwdom, opt_cpu_stack_hvm; +extern int8_t opt_zero_stack_pv, opt_zero_stack_hwdom, opt_zero_stack_hvm; extern bool cpu_has_bug_l1tf; extern int8_t opt_pv_l1tf_hwdom, opt_pv_l1tf_domu; diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/spec_ctrl.c +++ b/xen/arch/x86/spec_ctrl.c @@ -XXX,XX +XXX,XX @@ int8_t __ro_after_init opt_vcpu_pt_pv = -1; int8_t __ro_after_init opt_cpu_stack_hvm = -1; int8_t __ro_after_init opt_cpu_stack_hwdom = -1; int8_t __ro_after_init opt_cpu_stack_pv = -1; +/* Zero CPU stacks. */ +int8_t __ro_after_init opt_zero_stack_hvm = -1; +int8_t __ro_after_init opt_zero_stack_hwdom = -1; +int8_t __ro_after_init opt_zero_stack_pv = -1; static int __init cf_check parse_spec_ctrl(const char *s) { @@ -XXX,XX +XXX,XX @@ static int __init cf_check parse_asi(const char *s) { opt_vcpu_pt_pv = opt_vcpu_pt_hwdom = opt_vcpu_pt_hvm = 1; opt_cpu_stack_pv = opt_cpu_stack_hwdom = opt_cpu_stack_hvm = 1; + opt_zero_stack_pv = opt_zero_stack_hvm = opt_zero_stack_hwdom = 1; } do { @@ -XXX,XX +XXX,XX @@ static int __init cf_check parse_asi(const char *s) case 1: opt_vcpu_pt_pv = opt_vcpu_pt_hwdom = opt_vcpu_pt_hvm = val; opt_cpu_stack_pv = opt_cpu_stack_hvm = opt_cpu_stack_hwdom = val; + opt_zero_stack_pv = opt_zero_stack_hvm = opt_zero_stack_hwdom = val; break; default: if ( (val = parse_boolean("pv", s, ss)) >= 0 ) - opt_cpu_stack_pv = opt_vcpu_pt_pv = val; + opt_zero_stack_pv = opt_cpu_stack_pv = opt_vcpu_pt_pv = val; else if ( (val = parse_boolean("hvm", s, ss)) >= 0 ) - opt_cpu_stack_hvm = opt_vcpu_pt_hvm = val; + opt_zero_stack_hvm = opt_cpu_stack_hvm = opt_vcpu_pt_hvm = val; else if ( (val = parse_boolean("vcpu-pt", s, ss)) != -1 ) { switch ( val ) @@ -XXX,XX +XXX,XX @@ static int __init cf_check parse_asi(const char *s) break; } } + else if ( (val = parse_boolean("zero-stack", s, ss)) != -1 ) + { + switch ( val ) + { + case 1: + case 0: + opt_zero_stack_pv = opt_zero_stack_hvm = + opt_zero_stack_hwdom = val; + break; + + case -2: + s += strlen("zero-stack="); + if ( (val = parse_boolean("pv", s, ss)) >= 0 ) + opt_zero_stack_pv = val; + else if ( (val = parse_boolean("hvm", s, ss)) >= 0 ) + opt_zero_stack_hvm = val; + else + default: + rc = -EINVAL; + break; + } + } else if ( *s ) rc = -EINVAL; break; @@ -XXX,XX +XXX,XX @@ static void __init print_details(enum ind_thunk thunk) #endif #ifdef CONFIG_HVM - printk(" ASI features for HVM VMs:%s%s%s\n", - opt_vcpu_pt_hvm || opt_cpu_stack_hvm ? "" : " None", + printk(" ASI features for HVM VMs:%s%s%s%s\n", + opt_vcpu_pt_hvm || opt_cpu_stack_hvm || + opt_zero_stack_hvm ? "" : " None", opt_vcpu_pt_hvm ? " vCPU-PT" : "", - opt_cpu_stack_hvm ? " CPU-STACK" : ""); + opt_cpu_stack_hvm ? " CPU-STACK" : "", + opt_zero_stack_hvm ? " ZERO-STACK" : ""); #endif #ifdef CONFIG_PV - printk(" ASI features for PV VMs:%s%s%s\n", - opt_vcpu_pt_pv || opt_cpu_stack_pv ? "" : " None", + printk(" ASI features for PV VMs:%s%s%s%s\n", + opt_vcpu_pt_pv || opt_cpu_stack_pv || + opt_zero_stack_pv ? "" : " None", opt_vcpu_pt_pv ? " vCPU-PT" : "", - opt_cpu_stack_pv ? " CPU-STACK" : ""); + opt_cpu_stack_pv ? " CPU-STACK" : "", + opt_zero_stack_pv ? " ZERO-STACK" : ""); #endif } @@ -XXX,XX +XXX,XX @@ void spec_ctrl_init_domain(struct domain *d) d->arch.cpu_stack = is_hardware_domain(d) ? opt_cpu_stack_hwdom : pv ? opt_cpu_stack_pv : opt_cpu_stack_hvm; + d->arch.zero_stack = is_hardware_domain(d) ? opt_zero_stack_hwdom + : pv ? opt_zero_stack_pv + : opt_zero_stack_hvm; } void __init init_speculation_mitigations(void) @@ -XXX,XX +XXX,XX @@ void __init init_speculation_mitigations(void) opt_cpu_stack_hwdom = 0; if ( opt_cpu_stack_hvm == -1 ) opt_cpu_stack_hvm = 0; + if ( opt_zero_stack_pv == -1 ) + opt_zero_stack_pv = 0; + if ( opt_zero_stack_hwdom == -1 ) + opt_zero_stack_hwdom = 0; + if ( opt_zero_stack_hvm == -1 ) + opt_zero_stack_hvm = 0; if ( opt_vcpu_pt_pv || opt_vcpu_pt_hvm ) warning_add( diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index XXXXXXX..XXXXXXX 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -XXX,XX +XXX,XX @@ static void unknown_nmi_error(const struct cpu_user_regs *regs, static nmi_callback_t *__read_mostly nmi_callback; DEFINE_PER_CPU(unsigned int, nmi_count); +DEFINE_PER_CPU(unsigned int, slice_nmi_count); void do_nmi(const struct cpu_user_regs *regs) { @@ -XXX,XX +XXX,XX @@ void do_nmi(const struct cpu_user_regs *regs) bool handle_unknown = false; this_cpu(nmi_count)++; + this_cpu(slice_nmi_count)++; nmi_enter(); /* @@ -XXX,XX +XXX,XX @@ void asmlinkage do_device_not_available(struct cpu_user_regs *regs) void nocall sysenter_eflags_saved(void); +DEFINE_PER_CPU(unsigned int, slice_db_count); + void asmlinkage do_debug(struct cpu_user_regs *regs) { unsigned long dr6; @@ -XXX,XX +XXX,XX @@ void asmlinkage do_debug(struct cpu_user_regs *regs) /* Stash dr6 as early as possible. */ dr6 = read_debugreg(6); + this_cpu(slice_db_count)++; /* * At the time of writing (March 2018), on the subject of %dr6: * -- 2.46.0