-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Xen Security Advisory CVE-2022-26362 / XSA-401
version 2
x86 pv: Race condition in typeref acquisition
UPDATES IN VERSION 2
====================
Update 4.16 and 4.15 baselines.
Public release.
ISSUE DESCRIPTION
=================
Xen maintains a type reference count for pages, in addition to a regular
reference count. This scheme is used to maintain invariants required
for Xen's safety, e.g. PV guests may not have direct writeable access to
pagetables; updates need auditing by Xen.
Unfortunately, the logic for acquiring a type reference has a race
condition, whereby a safely TLB flush is issued too early and creates a
window where the guest can re-establish the read/write mapping before
writeability is prohibited.
IMPACT
======
Malicious x86 PV guest administrators may be able to escalate privilege
so as to control the whole system.
VULNERABLE SYSTEMS
==================
All versions of Xen are vulnerable.
Only x86 PV guests can trigger this vulnerability.
To exploit the vulnerability, there needs to be an undue delay at just
the wrong moment in _get_page_type(). The degree to which an x86 PV
guest can practically control this race condition is unknown.
MITIGATION
==========
Not running x86 PV guests will avoid the vulnerability.
CREDITS
=======
This issue was discovered by Jann Horn of Google Project Zero.
RESOLUTION
==========
Applying the appropriate attached patches resolves this issue.
Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball. Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.
xsa401/xsa401-?.patch xen-unstable
xsa401/xsa401-4.16-?.patch Xen 4.16.x - Xen 4.14.x
xsa401/xsa401-4.13-?.patch Xen 4.13.x
$ sha256sum xsa401* xsa401*/*
d442bc0946eaa4c325226fd0805ab81eba6a68b68cffb9b03d9552edea86b118 xsa401.meta
074b57204f828cbd004c2d024b02a41af5d5bf3547d407af27249dca95eca13a xsa401/xsa401-1.patch
a095b39b203d501f9c9d4974638cd4d5e2d7a18daee7a7a61e2010dea477e212 xsa401/xsa401-2.patch
99af3efc91d2dbf4fd54cc9f454b87bd76edbc85abd1a20bdad0bd22acabf466 xsa401/xsa401-4.13-1.patch
bb997094052edbbbdd0dc9f3a0454508eb737556e2449ec6a0bc649deb921e4f xsa401/xsa401-4.13-2.patch
d336b31cb91466942e4fb8b44783bb2f0be4995076e70e0e78cdf992147cf72a xsa401/xsa401-4.16-1.patch
b380a76d67957b602ff3c9a3faaa4d9b6666422834d6ee3ab72432a6d07ddbc6 xsa401/xsa401-4.16-2.patch
$
DEPLOYMENT DURING EMBARGO
=========================
Deployment of the patches and/or mitigations described above (or
others which are substantially similar) is permitted during the
embargo, even on public-facing systems with untrusted guest users and
administrators.
But: Distribution of updated software is prohibited (except to other
members of the predisclosure list).
Predisclosure list members who wish to deploy significantly different
patches and/or mitigations, please contact the Xen Project Security
Team.
(Note: this during-embargo deployment notice is retained in
post-embargo publicly released Xen Project advisories, even though it
is then no longer applicable. This is to enable the community to have
oversight of the Xen Project Security Team's decisionmaking.)
For more information about permissible uses of embargoed information,
consult the Xen Project community's agreed Security Policy:
http://www.xenproject.org/security-policy.html
-----BEGIN PGP SIGNATURE-----
iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmKh4lsMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZcoAH/ijbKKkQet6frag9HVfDHZtcb6N7yIxMUioVOu9t
tNhg4LdJJnnrCqXmJdXygZTYwIZufQGQOxMR3b66+6MJyz0JIL7XExqnLJs6mDsO
GFcvsxoGLYSdsBTVtGQgLpEPxwgkblKUQuwokz3K3kdxcHJmJceZitvaDdrycw8M
kRZ22qHUbFWTSOKZNe5t9t0x/4xwdyM4dYElAmuN4Ej1cQhhXG/Gbl+acZexS+cz
TFEbIS5G/j6EgaCpBSP5XCoUn2LlyswRxBllGh0kpaLrJRH4CX3E/KHBSdPMkWoP
3HyQF3o+WYvpWUGXVaAREaR+WxlsAwmQJUxpO64O4Y4IUEY=
=UGgq
-----END PGP SIGNATURE-----
{
"XSA": 401,
"SupportedVersions": [
"master",
"4.16",
"4.15",
"4.14",
"4.13"
],
"Trees": [
"xen"
],
"Recipes": {
"4.13": {
"Recipes": {
"xen": {
"StableRef": "fe97133b5deef58bd1422f4d87821131c66b1d0e",
"Prereqs": [],
"Patches": [
"xsa401/xsa401-4.13-?.patch"
]
}
}
},
"4.14": {
"Recipes": {
"xen": {
"StableRef": "17848dfed47f52b479c4e7eb412671aec5757329",
"Prereqs": [],
"Patches": [
"xsa401/xsa401-4.16-?.patch"
]
}
}
},
"4.15": {
"Recipes": {
"xen": {
"StableRef": "64249afeb63cf7d70b4faf02e76df5eed82371f9",
"Prereqs": [],
"Patches": [
"xsa401/xsa401-4.16-?.patch"
]
}
}
},
"4.16": {
"Recipes": {
"xen": {
"StableRef": "8e11ec8fbf6f933f8854f4bc54226653316903f2",
"Prereqs": [],
"Patches": [
"xsa401/xsa401-4.16-?.patch"
]
}
}
},
"master": {
"Recipes": {
"xen": {
"StableRef": "49dd52fb1311dadab29f6634d0bc1f4c022c357a",
"Prereqs": [],
"Patches": [
"xsa401/xsa401-?.patch"
]
}
}
}
}
}From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Clean up _get_page_type()
Various fixes for clarity, ahead of making complicated changes.
* Split the overflow check out of the if/else chain for type handling, as
it's somewhat unrelated.
* Comment the main if/else chain to explain what is going on. Adjust one
ASSERT() and state the bit layout for validate-locked and partial states.
* Correct the comment about TLB flushing, as it's backwards. The problem
case is when writeable mappings are retained to a page becoming read-only,
as it allows the guest to bypass Xen's safety checks for updates.
* Reduce the scope of 'y'. It is an artefact of the cmpxchg loop and not
valid for use by subsequent logic. Switch to using ACCESS_ONCE() to treat
all reads as explicitly volatile. The only thing preventing the validated
wait-loop being infinite is the compiler barrier hidden in cpu_relax().
* Replace one page_get_owner(page) with the already-calculated 'd' already in
scope.
No functional change.
This is part of XSA-401 / CVE-2022-26362.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 04d5ec705d8e..6434900aa767 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2935,16 +2935,17 @@ static int _put_page_type(struct page_info *page, unsigned int flags,
static int _get_page_type(struct page_info *page, unsigned long type,
bool preemptible)
{
- unsigned long nx, x, y = page->u.inuse.type_info;
+ unsigned long nx, x;
int rc = 0;
ASSERT(!(type & ~(PGT_type_mask | PGT_pae_xen_l2)));
ASSERT(!in_irq());
- for ( ; ; )
+ for ( unsigned long y = ACCESS_ONCE(page->u.inuse.type_info); ; )
{
x = y;
nx = x + 1;
+
if ( unlikely((nx & PGT_count_mask) == 0) )
{
gdprintk(XENLOG_WARNING,
@@ -2952,8 +2953,15 @@ static int _get_page_type(struct page_info *page, unsigned long type,
mfn_x(page_to_mfn(page)));
return -EINVAL;
}
- else if ( unlikely((x & PGT_count_mask) == 0) )
+
+ if ( unlikely((x & PGT_count_mask) == 0) )
{
+ /*
+ * Typeref 0 -> 1.
+ *
+ * Type changes are permitted when the typeref is 0. If the type
+ * actually changes, the page needs re-validating.
+ */
struct domain *d = page_get_owner(page);
if ( d && shadow_mode_enabled(d) )
@@ -2964,8 +2972,8 @@ static int _get_page_type(struct page_info *page, unsigned long type,
{
/*
* On type change we check to flush stale TLB entries. It is
- * vital that no other CPUs are left with mappings of a frame
- * which is about to become writeable to the guest.
+ * vital that no other CPUs are left with writeable mappings
+ * to a frame which is intending to become pgtable/segdesc.
*/
cpumask_t *mask = this_cpu(scratch_cpumask);
@@ -2977,7 +2985,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
if ( unlikely(!cpumask_empty(mask)) &&
/* Shadow mode: track only writable pages. */
- (!shadow_mode_enabled(page_get_owner(page)) ||
+ (!shadow_mode_enabled(d) ||
((nx & PGT_type_mask) == PGT_writable_page)) )
{
perfc_incr(need_flush_tlb_flush);
@@ -3008,7 +3016,14 @@ static int _get_page_type(struct page_info *page, unsigned long type,
}
else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) )
{
- /* Don't log failure if it could be a recursive-mapping attempt. */
+ /*
+ * else, we're trying to take a new reference, of the wrong type.
+ *
+ * This (being able to prohibit use of the wrong type) is what the
+ * typeref system exists for, but skip printing the failure if it
+ * looks like a recursive mapping, as subsequent logic might
+ * ultimately permit the attempt.
+ */
if ( ((x & PGT_type_mask) == PGT_l2_page_table) &&
(type == PGT_l1_page_table) )
return -EINVAL;
@@ -3027,18 +3042,46 @@ static int _get_page_type(struct page_info *page, unsigned long type,
}
else if ( unlikely(!(x & PGT_validated)) )
{
+ /*
+ * else, the count is non-zero, and we're grabbing the right type;
+ * but the page hasn't been validated yet.
+ *
+ * The page is in one of two states (depending on PGT_partial),
+ * and should have exactly one reference.
+ */
+ ASSERT((x & (PGT_type_mask | PGT_count_mask)) == (type | 1));
+
if ( !(x & PGT_partial) )
{
- /* Someone else is updating validation of this page. Wait... */
+ /*
+ * The page has been left in the "validate locked" state
+ * (i.e. PGT_[type] | 1) which means that a concurrent caller
+ * of _get_page_type() is in the middle of validation.
+ *
+ * Spin waiting for the concurrent user to complete (partial
+ * or fully validated), then restart our attempt to acquire a
+ * type reference.
+ */
do {
if ( preemptible && hypercall_preempt_check() )
return -EINTR;
cpu_relax();
- } while ( (y = page->u.inuse.type_info) == x );
+ } while ( (y = ACCESS_ONCE(page->u.inuse.type_info)) == x );
continue;
}
- /* Type ref count was left at 1 when PGT_partial got set. */
- ASSERT((x & PGT_count_mask) == 1);
+
+ /*
+ * The page has been left in the "partial" state
+ * (i.e., PGT_[type] | PGT_partial | 1).
+ *
+ * Rather than bumping the type count, we need to try to grab the
+ * validation lock; if we succeed, we need to validate the page,
+ * then drop the general ref associated with the PGT_partial bit.
+ *
+ * We grab the validation lock by setting nx to (PGT_[type] | 1)
+ * (i.e., non-zero type count, neither PGT_validated nor
+ * PGT_partial set).
+ */
nx = x & ~PGT_partial;
}
@@ -3087,6 +3130,13 @@ static int _get_page_type(struct page_info *page, unsigned long type,
}
out:
+ /*
+ * Did we drop the PGT_partial bit when acquiring the typeref? If so,
+ * drop the general reference that went along with it.
+ *
+ * N.B. validate_page() may have have re-set PGT_partial, not reflected in
+ * nx, but will have taken an extra ref when doing so.
+ */
if ( (x & PGT_partial) && !(nx & PGT_partial) )
put_page(page);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Fix ABAC cmpxchg() race in _get_page_type()
_get_page_type() suffers from a race condition where it incorrectly assumes
that because 'x' was read and a subsequent a cmpxchg() succeeds, the type
cannot have changed in-between. Consider:
CPU A:
1. Creates an L2e referencing pg
`-> _get_page_type(pg, PGT_l1_page_table), sees count 0, type PGT_writable_page
2. Issues flush_tlb_mask()
CPU B:
3. Creates a writeable mapping of pg
`-> _get_page_type(pg, PGT_writable_page), count increases to 1
4. Writes into new mapping, creating a TLB entry for pg
5. Removes the writeable mapping of pg
`-> _put_page_type(pg), count goes back down to 0
CPU A:
7. Issues cmpxchg(), setting count 1, type PGT_l1_page_table
CPU B now has a writeable mapping to pg, which Xen believes is a pagetable and
suitably protected (i.e. read-only). The TLB flush in step 2 must be deferred
until after the guest is prohibited from creating new writeable mappings,
which is after step 7.
Defer all safety actions until after the cmpxchg() has successfully taken the
intended typeref, because that is what prevents concurrent users from using
the old type.
Also remove the early validation for writeable and shared pages. This removes
race conditions where one half of a parallel mapping attempt can return
successfully before:
* The IOMMU pagetables are in sync with the new page type
* Writeable mappings to shared pages have been torn down
This is part of XSA-401 / CVE-2022-26362.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 6434900aa767..34bb9dddab8d 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2962,56 +2962,12 @@ static int _get_page_type(struct page_info *page, unsigned long type,
* Type changes are permitted when the typeref is 0. If the type
* actually changes, the page needs re-validating.
*/
- struct domain *d = page_get_owner(page);
-
- if ( d && shadow_mode_enabled(d) )
- shadow_prepare_page_type_change(d, page, type);
ASSERT(!(x & PGT_pae_xen_l2));
if ( (x & PGT_type_mask) != type )
{
- /*
- * On type change we check to flush stale TLB entries. It is
- * vital that no other CPUs are left with writeable mappings
- * to a frame which is intending to become pgtable/segdesc.
- */
- cpumask_t *mask = this_cpu(scratch_cpumask);
-
- BUG_ON(in_irq());
- cpumask_copy(mask, d->dirty_cpumask);
-
- /* Don't flush if the timestamp is old enough */
- tlbflush_filter(mask, page->tlbflush_timestamp);
-
- if ( unlikely(!cpumask_empty(mask)) &&
- /* Shadow mode: track only writable pages. */
- (!shadow_mode_enabled(d) ||
- ((nx & PGT_type_mask) == PGT_writable_page)) )
- {
- perfc_incr(need_flush_tlb_flush);
- /*
- * If page was a page table make sure the flush is
- * performed using an IPI in order to avoid changing the
- * type of a page table page under the feet of
- * spurious_page_fault().
- */
- flush_mask(mask,
- (x & PGT_type_mask) &&
- (x & PGT_type_mask) <= PGT_root_page_table
- ? FLUSH_TLB | FLUSH_NO_ASSIST
- : FLUSH_TLB);
- }
-
- /* We lose existing type and validity. */
nx &= ~(PGT_type_mask | PGT_validated);
nx |= type;
-
- /*
- * No special validation needed for writable pages.
- * Page tables and GDT/LDT need to be scanned for validity.
- */
- if ( type == PGT_writable_page || type == PGT_shared_page )
- nx |= PGT_validated;
}
}
else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) )
@@ -3092,6 +3048,56 @@ static int _get_page_type(struct page_info *page, unsigned long type,
return -EINTR;
}
+ /*
+ * One typeref has been taken and is now globally visible.
+ *
+ * The page is either in the "validate locked" state (PGT_[type] | 1) or
+ * fully validated (PGT_[type] | PGT_validated | >0).
+ */
+
+ if ( unlikely((x & PGT_count_mask) == 0) )
+ {
+ struct domain *d = page_get_owner(page);
+
+ if ( d && shadow_mode_enabled(d) )
+ shadow_prepare_page_type_change(d, page, type);
+
+ if ( (x & PGT_type_mask) != type )
+ {
+ /*
+ * On type change we check to flush stale TLB entries. It is
+ * vital that no other CPUs are left with writeable mappings
+ * to a frame which is intending to become pgtable/segdesc.
+ */
+ cpumask_t *mask = this_cpu(scratch_cpumask);
+
+ BUG_ON(in_irq());
+ cpumask_copy(mask, d->dirty_cpumask);
+
+ /* Don't flush if the timestamp is old enough */
+ tlbflush_filter(mask, page->tlbflush_timestamp);
+
+ if ( unlikely(!cpumask_empty(mask)) &&
+ /* Shadow mode: track only writable pages. */
+ (!shadow_mode_enabled(d) ||
+ ((nx & PGT_type_mask) == PGT_writable_page)) )
+ {
+ perfc_incr(need_flush_tlb_flush);
+ /*
+ * If page was a page table make sure the flush is
+ * performed using an IPI in order to avoid changing the
+ * type of a page table page under the feet of
+ * spurious_page_fault().
+ */
+ flush_mask(mask,
+ (x & PGT_type_mask) &&
+ (x & PGT_type_mask) <= PGT_root_page_table
+ ? FLUSH_TLB | FLUSH_NO_ASSIST
+ : FLUSH_TLB);
+ }
+ }
+ }
+
if ( unlikely(((x & PGT_type_mask) == PGT_writable_page) !=
(type == PGT_writable_page)) )
{
@@ -3120,13 +3126,25 @@ static int _get_page_type(struct page_info *page, unsigned long type,
if ( unlikely(!(nx & PGT_validated)) )
{
- if ( !(x & PGT_partial) )
+ /*
+ * No special validation needed for writable or shared pages. Page
+ * tables and GDT/LDT need to have their contents audited.
+ *
+ * per validate_page(), non-atomic updates are fine here.
+ */
+ if ( type == PGT_writable_page || type == PGT_shared_page )
+ page->u.inuse.type_info |= PGT_validated;
+ else
{
- page->nr_validated_ptes = 0;
- page->partial_flags = 0;
- page->linear_pt_count = 0;
+ if ( !(x & PGT_partial) )
+ {
+ page->nr_validated_ptes = 0;
+ page->partial_flags = 0;
+ page->linear_pt_count = 0;
+ }
+
+ rc = validate_page(page, type, preemptible);
}
- rc = validate_page(page, type, preemptible);
}
out:
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Clean up _get_page_type()
Various fixes for clarity, ahead of making complicated changes.
* Split the overflow check out of the if/else chain for type handling, as
it's somewhat unrelated.
* Comment the main if/else chain to explain what is going on. Adjust one
ASSERT() and state the bit layout for validate-locked and partial states.
* Correct the comment about TLB flushing, as it's backwards. The problem
case is when writeable mappings are retained to a page becoming read-only,
as it allows the guest to bypass Xen's safety checks for updates.
* Reduce the scope of 'y'. It is an artefact of the cmpxchg loop and not
valid for use by subsequent logic. Switch to using ACCESS_ONCE() to treat
all reads as explicitly volatile. The only thing preventing the validated
wait-loop being infinite is the compiler barrier hidden in cpu_relax().
* Replace one page_get_owner(page) with the already-calculated 'd' already in
scope.
No functional change.
This is part of XSA-401 / CVE-2022-26362.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index ad89bfb45fff..96738b027827 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2978,16 +2978,17 @@ static int _put_page_type(struct page_info *page, unsigned int flags,
static int _get_page_type(struct page_info *page, unsigned long type,
bool preemptible)
{
- unsigned long nx, x, y = page->u.inuse.type_info;
+ unsigned long nx, x;
int rc = 0;
ASSERT(!(type & ~(PGT_type_mask | PGT_pae_xen_l2)));
ASSERT(!in_irq());
- for ( ; ; )
+ for ( unsigned long y = ACCESS_ONCE(page->u.inuse.type_info); ; )
{
x = y;
nx = x + 1;
+
if ( unlikely((nx & PGT_count_mask) == 0) )
{
gdprintk(XENLOG_WARNING,
@@ -2995,8 +2996,15 @@ static int _get_page_type(struct page_info *page, unsigned long type,
mfn_x(page_to_mfn(page)));
return -EINVAL;
}
- else if ( unlikely((x & PGT_count_mask) == 0) )
+
+ if ( unlikely((x & PGT_count_mask) == 0) )
{
+ /*
+ * Typeref 0 -> 1.
+ *
+ * Type changes are permitted when the typeref is 0. If the type
+ * actually changes, the page needs re-validating.
+ */
struct domain *d = page_get_owner(page);
if ( d && shadow_mode_enabled(d) )
@@ -3007,8 +3015,8 @@ static int _get_page_type(struct page_info *page, unsigned long type,
{
/*
* On type change we check to flush stale TLB entries. It is
- * vital that no other CPUs are left with mappings of a frame
- * which is about to become writeable to the guest.
+ * vital that no other CPUs are left with writeable mappings
+ * to a frame which is intending to become pgtable/segdesc.
*/
cpumask_t *mask = this_cpu(scratch_cpumask);
@@ -3020,7 +3028,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
if ( unlikely(!cpumask_empty(mask)) &&
/* Shadow mode: track only writable pages. */
- (!shadow_mode_enabled(page_get_owner(page)) ||
+ (!shadow_mode_enabled(d) ||
((nx & PGT_type_mask) == PGT_writable_page)) )
{
perfc_incr(need_flush_tlb_flush);
@@ -3041,7 +3049,14 @@ static int _get_page_type(struct page_info *page, unsigned long type,
}
else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) )
{
- /* Don't log failure if it could be a recursive-mapping attempt. */
+ /*
+ * else, we're trying to take a new reference, of the wrong type.
+ *
+ * This (being able to prohibit use of the wrong type) is what the
+ * typeref system exists for, but skip printing the failure if it
+ * looks like a recursive mapping, as subsequent logic might
+ * ultimately permit the attempt.
+ */
if ( ((x & PGT_type_mask) == PGT_l2_page_table) &&
(type == PGT_l1_page_table) )
return -EINVAL;
@@ -3060,18 +3075,46 @@ static int _get_page_type(struct page_info *page, unsigned long type,
}
else if ( unlikely(!(x & PGT_validated)) )
{
+ /*
+ * else, the count is non-zero, and we're grabbing the right type;
+ * but the page hasn't been validated yet.
+ *
+ * The page is in one of two states (depending on PGT_partial),
+ * and should have exactly one reference.
+ */
+ ASSERT((x & (PGT_type_mask | PGT_count_mask)) == (type | 1));
+
if ( !(x & PGT_partial) )
{
- /* Someone else is updating validation of this page. Wait... */
+ /*
+ * The page has been left in the "validate locked" state
+ * (i.e. PGT_[type] | 1) which means that a concurrent caller
+ * of _get_page_type() is in the middle of validation.
+ *
+ * Spin waiting for the concurrent user to complete (partial
+ * or fully validated), then restart our attempt to acquire a
+ * type reference.
+ */
do {
if ( preemptible && hypercall_preempt_check() )
return -EINTR;
cpu_relax();
- } while ( (y = page->u.inuse.type_info) == x );
+ } while ( (y = ACCESS_ONCE(page->u.inuse.type_info)) == x );
continue;
}
- /* Type ref count was left at 1 when PGT_partial got set. */
- ASSERT((x & PGT_count_mask) == 1);
+
+ /*
+ * The page has been left in the "partial" state
+ * (i.e., PGT_[type] | PGT_partial | 1).
+ *
+ * Rather than bumping the type count, we need to try to grab the
+ * validation lock; if we succeed, we need to validate the page,
+ * then drop the general ref associated with the PGT_partial bit.
+ *
+ * We grab the validation lock by setting nx to (PGT_[type] | 1)
+ * (i.e., non-zero type count, neither PGT_validated nor
+ * PGT_partial set).
+ */
nx = x & ~PGT_partial;
}
@@ -3116,6 +3159,13 @@ static int _get_page_type(struct page_info *page, unsigned long type,
}
out:
+ /*
+ * Did we drop the PGT_partial bit when acquiring the typeref? If so,
+ * drop the general reference that went along with it.
+ *
+ * N.B. validate_page() may have have re-set PGT_partial, not reflected in
+ * nx, but will have taken an extra ref when doing so.
+ */
if ( (x & PGT_partial) && !(nx & PGT_partial) )
put_page(page);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Fix ABAC cmpxchg() race in _get_page_type()
_get_page_type() suffers from a race condition where it incorrectly assumes
that because 'x' was read and a subsequent a cmpxchg() succeeds, the type
cannot have changed in-between. Consider:
CPU A:
1. Creates an L2e referencing pg
`-> _get_page_type(pg, PGT_l1_page_table), sees count 0, type PGT_writable_page
2. Issues flush_tlb_mask()
CPU B:
3. Creates a writeable mapping of pg
`-> _get_page_type(pg, PGT_writable_page), count increases to 1
4. Writes into new mapping, creating a TLB entry for pg
5. Removes the writeable mapping of pg
`-> _put_page_type(pg), count goes back down to 0
CPU A:
7. Issues cmpxchg(), setting count 1, type PGT_l1_page_table
CPU B now has a writeable mapping to pg, which Xen believes is a pagetable and
suitably protected (i.e. read-only). The TLB flush in step 2 must be deferred
until after the guest is prohibited from creating new writeable mappings,
which is after step 7.
Defer all safety actions until after the cmpxchg() has successfully taken the
intended typeref, because that is what prevents concurrent users from using
the old type.
Also remove the early validation for writeable and shared pages. This removes
race conditions where one half of a parallel mapping attempt can return
successfully before:
* The IOMMU pagetables are in sync with the new page type
* Writeable mappings to shared pages have been torn down
This is part of XSA-401 / CVE-2022-26362.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 96738b027827..ee91c7fe5f69 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3005,46 +3005,12 @@ static int _get_page_type(struct page_info *page, unsigned long type,
* Type changes are permitted when the typeref is 0. If the type
* actually changes, the page needs re-validating.
*/
- struct domain *d = page_get_owner(page);
-
- if ( d && shadow_mode_enabled(d) )
- shadow_prepare_page_type_change(d, page, type);
ASSERT(!(x & PGT_pae_xen_l2));
if ( (x & PGT_type_mask) != type )
{
- /*
- * On type change we check to flush stale TLB entries. It is
- * vital that no other CPUs are left with writeable mappings
- * to a frame which is intending to become pgtable/segdesc.
- */
- cpumask_t *mask = this_cpu(scratch_cpumask);
-
- BUG_ON(in_irq());
- cpumask_copy(mask, d->dirty_cpumask);
-
- /* Don't flush if the timestamp is old enough */
- tlbflush_filter(mask, page->tlbflush_timestamp);
-
- if ( unlikely(!cpumask_empty(mask)) &&
- /* Shadow mode: track only writable pages. */
- (!shadow_mode_enabled(d) ||
- ((nx & PGT_type_mask) == PGT_writable_page)) )
- {
- perfc_incr(need_flush_tlb_flush);
- flush_tlb_mask(mask);
- }
-
- /* We lose existing type and validity. */
nx &= ~(PGT_type_mask | PGT_validated);
nx |= type;
-
- /*
- * No special validation needed for writable pages.
- * Page tables and GDT/LDT need to be scanned for validity.
- */
- if ( type == PGT_writable_page || type == PGT_shared_page )
- nx |= PGT_validated;
}
}
else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) )
@@ -3125,6 +3091,46 @@ static int _get_page_type(struct page_info *page, unsigned long type,
return -EINTR;
}
+ /*
+ * One typeref has been taken and is now globally visible.
+ *
+ * The page is either in the "validate locked" state (PGT_[type] | 1) or
+ * fully validated (PGT_[type] | PGT_validated | >0).
+ */
+
+ if ( unlikely((x & PGT_count_mask) == 0) )
+ {
+ struct domain *d = page_get_owner(page);
+
+ if ( d && shadow_mode_enabled(d) )
+ shadow_prepare_page_type_change(d, page, type);
+
+ if ( (x & PGT_type_mask) != type )
+ {
+ /*
+ * On type change we check to flush stale TLB entries. It is
+ * vital that no other CPUs are left with writeable mappings
+ * to a frame which is intending to become pgtable/segdesc.
+ */
+ cpumask_t *mask = this_cpu(scratch_cpumask);
+
+ BUG_ON(in_irq());
+ cpumask_copy(mask, d->dirty_cpumask);
+
+ /* Don't flush if the timestamp is old enough */
+ tlbflush_filter(mask, page->tlbflush_timestamp);
+
+ if ( unlikely(!cpumask_empty(mask)) &&
+ /* Shadow mode: track only writable pages. */
+ (!shadow_mode_enabled(d) ||
+ ((nx & PGT_type_mask) == PGT_writable_page)) )
+ {
+ perfc_incr(need_flush_tlb_flush);
+ flush_tlb_mask(mask);
+ }
+ }
+ }
+
if ( unlikely((x & PGT_type_mask) != type) )
{
/* Special pages should not be accessible from devices. */
@@ -3149,13 +3155,25 @@ static int _get_page_type(struct page_info *page, unsigned long type,
if ( unlikely(!(nx & PGT_validated)) )
{
- if ( !(x & PGT_partial) )
+ /*
+ * No special validation needed for writable or shared pages. Page
+ * tables and GDT/LDT need to have their contents audited.
+ *
+ * per validate_page(), non-atomic updates are fine here.
+ */
+ if ( type == PGT_writable_page || type == PGT_shared_page )
+ page->u.inuse.type_info |= PGT_validated;
+ else
{
- page->nr_validated_ptes = 0;
- page->partial_flags = 0;
- page->linear_pt_count = 0;
+ if ( !(x & PGT_partial) )
+ {
+ page->nr_validated_ptes = 0;
+ page->partial_flags = 0;
+ page->linear_pt_count = 0;
+ }
+
+ rc = alloc_page_type(page, type, preemptible);
}
- rc = alloc_page_type(page, type, preemptible);
}
out:
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Clean up _get_page_type()
Various fixes for clarity, ahead of making complicated changes.
* Split the overflow check out of the if/else chain for type handling, as
it's somewhat unrelated.
* Comment the main if/else chain to explain what is going on. Adjust one
ASSERT() and state the bit layout for validate-locked and partial states.
* Correct the comment about TLB flushing, as it's backwards. The problem
case is when writeable mappings are retained to a page becoming read-only,
as it allows the guest to bypass Xen's safety checks for updates.
* Reduce the scope of 'y'. It is an artefact of the cmpxchg loop and not
valid for use by subsequent logic. Switch to using ACCESS_ONCE() to treat
all reads as explicitly volatile. The only thing preventing the validated
wait-loop being infinite is the compiler barrier hidden in cpu_relax().
* Replace one page_get_owner(page) with the already-calculated 'd' already in
scope.
No functional change.
This is part of XSA-401 / CVE-2022-26362.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 796faca64103..ddd32f88c798 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2935,16 +2935,17 @@ static int _put_page_type(struct page_info *page, unsigned int flags,
static int _get_page_type(struct page_info *page, unsigned long type,
bool preemptible)
{
- unsigned long nx, x, y = page->u.inuse.type_info;
+ unsigned long nx, x;
int rc = 0;
ASSERT(!(type & ~(PGT_type_mask | PGT_pae_xen_l2)));
ASSERT(!in_irq());
- for ( ; ; )
+ for ( unsigned long y = ACCESS_ONCE(page->u.inuse.type_info); ; )
{
x = y;
nx = x + 1;
+
if ( unlikely((nx & PGT_count_mask) == 0) )
{
gdprintk(XENLOG_WARNING,
@@ -2952,8 +2953,15 @@ static int _get_page_type(struct page_info *page, unsigned long type,
mfn_x(page_to_mfn(page)));
return -EINVAL;
}
- else if ( unlikely((x & PGT_count_mask) == 0) )
+
+ if ( unlikely((x & PGT_count_mask) == 0) )
{
+ /*
+ * Typeref 0 -> 1.
+ *
+ * Type changes are permitted when the typeref is 0. If the type
+ * actually changes, the page needs re-validating.
+ */
struct domain *d = page_get_owner(page);
if ( d && shadow_mode_enabled(d) )
@@ -2964,8 +2972,8 @@ static int _get_page_type(struct page_info *page, unsigned long type,
{
/*
* On type change we check to flush stale TLB entries. It is
- * vital that no other CPUs are left with mappings of a frame
- * which is about to become writeable to the guest.
+ * vital that no other CPUs are left with writeable mappings
+ * to a frame which is intending to become pgtable/segdesc.
*/
cpumask_t *mask = this_cpu(scratch_cpumask);
@@ -2977,7 +2985,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
if ( unlikely(!cpumask_empty(mask)) &&
/* Shadow mode: track only writable pages. */
- (!shadow_mode_enabled(page_get_owner(page)) ||
+ (!shadow_mode_enabled(d) ||
((nx & PGT_type_mask) == PGT_writable_page)) )
{
perfc_incr(need_flush_tlb_flush);
@@ -3008,7 +3016,14 @@ static int _get_page_type(struct page_info *page, unsigned long type,
}
else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) )
{
- /* Don't log failure if it could be a recursive-mapping attempt. */
+ /*
+ * else, we're trying to take a new reference, of the wrong type.
+ *
+ * This (being able to prohibit use of the wrong type) is what the
+ * typeref system exists for, but skip printing the failure if it
+ * looks like a recursive mapping, as subsequent logic might
+ * ultimately permit the attempt.
+ */
if ( ((x & PGT_type_mask) == PGT_l2_page_table) &&
(type == PGT_l1_page_table) )
return -EINVAL;
@@ -3027,18 +3042,46 @@ static int _get_page_type(struct page_info *page, unsigned long type,
}
else if ( unlikely(!(x & PGT_validated)) )
{
+ /*
+ * else, the count is non-zero, and we're grabbing the right type;
+ * but the page hasn't been validated yet.
+ *
+ * The page is in one of two states (depending on PGT_partial),
+ * and should have exactly one reference.
+ */
+ ASSERT((x & (PGT_type_mask | PGT_count_mask)) == (type | 1));
+
if ( !(x & PGT_partial) )
{
- /* Someone else is updating validation of this page. Wait... */
+ /*
+ * The page has been left in the "validate locked" state
+ * (i.e. PGT_[type] | 1) which means that a concurrent caller
+ * of _get_page_type() is in the middle of validation.
+ *
+ * Spin waiting for the concurrent user to complete (partial
+ * or fully validated), then restart our attempt to acquire a
+ * type reference.
+ */
do {
if ( preemptible && hypercall_preempt_check() )
return -EINTR;
cpu_relax();
- } while ( (y = page->u.inuse.type_info) == x );
+ } while ( (y = ACCESS_ONCE(page->u.inuse.type_info)) == x );
continue;
}
- /* Type ref count was left at 1 when PGT_partial got set. */
- ASSERT((x & PGT_count_mask) == 1);
+
+ /*
+ * The page has been left in the "partial" state
+ * (i.e., PGT_[type] | PGT_partial | 1).
+ *
+ * Rather than bumping the type count, we need to try to grab the
+ * validation lock; if we succeed, we need to validate the page,
+ * then drop the general ref associated with the PGT_partial bit.
+ *
+ * We grab the validation lock by setting nx to (PGT_[type] | 1)
+ * (i.e., non-zero type count, neither PGT_validated nor
+ * PGT_partial set).
+ */
nx = x & ~PGT_partial;
}
@@ -3087,6 +3130,13 @@ static int _get_page_type(struct page_info *page, unsigned long type,
}
out:
+ /*
+ * Did we drop the PGT_partial bit when acquiring the typeref? If so,
+ * drop the general reference that went along with it.
+ *
+ * N.B. validate_page() may have have re-set PGT_partial, not reflected in
+ * nx, but will have taken an extra ref when doing so.
+ */
if ( (x & PGT_partial) && !(nx & PGT_partial) )
put_page(page);
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Fix ABAC cmpxchg() race in _get_page_type()
_get_page_type() suffers from a race condition where it incorrectly assumes
that because 'x' was read and a subsequent a cmpxchg() succeeds, the type
cannot have changed in-between. Consider:
CPU A:
1. Creates an L2e referencing pg
`-> _get_page_type(pg, PGT_l1_page_table), sees count 0, type PGT_writable_page
2. Issues flush_tlb_mask()
CPU B:
3. Creates a writeable mapping of pg
`-> _get_page_type(pg, PGT_writable_page), count increases to 1
4. Writes into new mapping, creating a TLB entry for pg
5. Removes the writeable mapping of pg
`-> _put_page_type(pg), count goes back down to 0
CPU A:
7. Issues cmpxchg(), setting count 1, type PGT_l1_page_table
CPU B now has a writeable mapping to pg, which Xen believes is a pagetable and
suitably protected (i.e. read-only). The TLB flush in step 2 must be deferred
until after the guest is prohibited from creating new writeable mappings,
which is after step 7.
Defer all safety actions until after the cmpxchg() has successfully taken the
intended typeref, because that is what prevents concurrent users from using
the old type.
Also remove the early validation for writeable and shared pages. This removes
race conditions where one half of a parallel mapping attempt can return
successfully before:
* The IOMMU pagetables are in sync with the new page type
* Writeable mappings to shared pages have been torn down
This is part of XSA-401 / CVE-2022-26362.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index ddd32f88c798..1693b580b152 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2962,56 +2962,12 @@ static int _get_page_type(struct page_info *page, unsigned long type,
* Type changes are permitted when the typeref is 0. If the type
* actually changes, the page needs re-validating.
*/
- struct domain *d = page_get_owner(page);
-
- if ( d && shadow_mode_enabled(d) )
- shadow_prepare_page_type_change(d, page, type);
ASSERT(!(x & PGT_pae_xen_l2));
if ( (x & PGT_type_mask) != type )
{
- /*
- * On type change we check to flush stale TLB entries. It is
- * vital that no other CPUs are left with writeable mappings
- * to a frame which is intending to become pgtable/segdesc.
- */
- cpumask_t *mask = this_cpu(scratch_cpumask);
-
- BUG_ON(in_irq());
- cpumask_copy(mask, d->dirty_cpumask);
-
- /* Don't flush if the timestamp is old enough */
- tlbflush_filter(mask, page->tlbflush_timestamp);
-
- if ( unlikely(!cpumask_empty(mask)) &&
- /* Shadow mode: track only writable pages. */
- (!shadow_mode_enabled(d) ||
- ((nx & PGT_type_mask) == PGT_writable_page)) )
- {
- perfc_incr(need_flush_tlb_flush);
- /*
- * If page was a page table make sure the flush is
- * performed using an IPI in order to avoid changing the
- * type of a page table page under the feet of
- * spurious_page_fault().
- */
- flush_mask(mask,
- (x & PGT_type_mask) &&
- (x & PGT_type_mask) <= PGT_root_page_table
- ? FLUSH_TLB | FLUSH_FORCE_IPI
- : FLUSH_TLB);
- }
-
- /* We lose existing type and validity. */
nx &= ~(PGT_type_mask | PGT_validated);
nx |= type;
-
- /*
- * No special validation needed for writable pages.
- * Page tables and GDT/LDT need to be scanned for validity.
- */
- if ( type == PGT_writable_page || type == PGT_shared_page )
- nx |= PGT_validated;
}
}
else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) )
@@ -3092,6 +3048,56 @@ static int _get_page_type(struct page_info *page, unsigned long type,
return -EINTR;
}
+ /*
+ * One typeref has been taken and is now globally visible.
+ *
+ * The page is either in the "validate locked" state (PGT_[type] | 1) or
+ * fully validated (PGT_[type] | PGT_validated | >0).
+ */
+
+ if ( unlikely((x & PGT_count_mask) == 0) )
+ {
+ struct domain *d = page_get_owner(page);
+
+ if ( d && shadow_mode_enabled(d) )
+ shadow_prepare_page_type_change(d, page, type);
+
+ if ( (x & PGT_type_mask) != type )
+ {
+ /*
+ * On type change we check to flush stale TLB entries. It is
+ * vital that no other CPUs are left with writeable mappings
+ * to a frame which is intending to become pgtable/segdesc.
+ */
+ cpumask_t *mask = this_cpu(scratch_cpumask);
+
+ BUG_ON(in_irq());
+ cpumask_copy(mask, d->dirty_cpumask);
+
+ /* Don't flush if the timestamp is old enough */
+ tlbflush_filter(mask, page->tlbflush_timestamp);
+
+ if ( unlikely(!cpumask_empty(mask)) &&
+ /* Shadow mode: track only writable pages. */
+ (!shadow_mode_enabled(d) ||
+ ((nx & PGT_type_mask) == PGT_writable_page)) )
+ {
+ perfc_incr(need_flush_tlb_flush);
+ /*
+ * If page was a page table make sure the flush is
+ * performed using an IPI in order to avoid changing the
+ * type of a page table page under the feet of
+ * spurious_page_fault().
+ */
+ flush_mask(mask,
+ (x & PGT_type_mask) &&
+ (x & PGT_type_mask) <= PGT_root_page_table
+ ? FLUSH_TLB | FLUSH_FORCE_IPI
+ : FLUSH_TLB);
+ }
+ }
+ }
+
if ( unlikely(((x & PGT_type_mask) == PGT_writable_page) !=
(type == PGT_writable_page)) )
{
@@ -3120,13 +3126,25 @@ static int _get_page_type(struct page_info *page, unsigned long type,
if ( unlikely(!(nx & PGT_validated)) )
{
- if ( !(x & PGT_partial) )
+ /*
+ * No special validation needed for writable or shared pages. Page
+ * tables and GDT/LDT need to have their contents audited.
+ *
+ * per validate_page(), non-atomic updates are fine here.
+ */
+ if ( type == PGT_writable_page || type == PGT_shared_page )
+ page->u.inuse.type_info |= PGT_validated;
+ else
{
- page->nr_validated_ptes = 0;
- page->partial_flags = 0;
- page->linear_pt_count = 0;
+ if ( !(x & PGT_partial) )
+ {
+ page->nr_validated_ptes = 0;
+ page->partial_flags = 0;
+ page->linear_pt_count = 0;
+ }
+
+ rc = validate_page(page, type, preemptible);
}
- rc = validate_page(page, type, preemptible);
}
out:
© 2016 - 2024 Red Hat, Inc.