The original implementation has two issues: For one it doesn't preserve
non-canonical-ness of inputs in the range 0x8000000000000000 through
0x80007fffffffffff. Bogus guest pointers in that range would not cause a
(#GP) fault upon access, when they should.
And then there is an AMD-specific aspect, where only the low 48 bits of
an address are used for speculative execution; the architecturally
mandated #GP for non-canonical addresses would be raised at a later
execution stage. Therefore to prevent Xen controlled data to make it
into any of the caches in a guest controllable manner, we need to
additionally ensure that for non-canonical inputs bit 47 would be clear.
See the code comment for how addressing both is being achieved.
Fixes: 4dc181599142 ("x86/PV: harden guest memory accesses against speculative abuse")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: Two variants of part of the logic are being presented, both with
certain undesirable aspects: The first form is pretty large and
ugly (some improvement may be possible by introducing further
helper macros). The alternative form continues to use RCR, which
generally would be nice to do away with. Then again that's also
slightly smaller generated code.
RFC: The .irp/.irpc/.if constructs used in variant 1 will need checking
as to them working with old gas as well as Clang.
RFC: When the scratch registers aren't %r8...%r15, several unnecessary
REX prefixes are emitted, as users of the macro pass in 64-bit
registers. Similar to what's done to be able to use SETcc, we could
derive %e.. from %r.. to shrink code size some. (An alternative,
requiring to touch the use sites, would be to constrain the scratch
registers to AX...DI and pass in only the last two characters of
the names [e.g. "di", i.e. also without the leading %]. That would
make it straightforward to use both %r.. and %e.. at the same time.
RFC: If code size was of concern, then in variant 1 the XOR could in
principle be omitted.
--- a/xen/arch/x86/include/asm/asm-defns.h
+++ b/xen/arch/x86/include/asm/asm-defns.h
@@ -1,3 +1,5 @@
+#include <asm/page-bits.h>
+
#ifndef HAVE_AS_CLAC_STAC
.macro clac
.byte 0x0f, 0x01, 0xca
@@ -65,17 +67,55 @@
.macro guest_access_mask_ptr ptr:req, scratch1:req, scratch2:req
#if defined(CONFIG_SPECULATIVE_HARDEN_GUEST_ACCESS)
/*
- * Here we want
- *
- * ptr &= ~0ull >> (ptr < HYPERVISOR_VIRT_END);
- *
+ * Here we want to adjust \ptr such that
+ * - if it's within Xen range, it becomes non-canonical,
+ * - otherwise if it's (non-)canonical on input, it retains that property,
+ * - if the result is non-canonical, bit 47 is clear (to avoid
+ * potentially populating the cache with Xen data),
* but guaranteed without any conditional branches (hence in assembly).
+ *
+ * To achieve this we determine which bit to forcibly clear: Either bit 47
+ * (in case the address is below HYPERVISOR_VIRT_END) or bit 63. Further
+ * we determine whether for forcably set bit 63: In case we first cleared
+ * it, we'll merely restore the original address. In case we ended up
+ * clearing bit 47 (i.e. the address was either non-canonical or within Xen
+ * range), setting the bit will yield a guaranteed non-canonical address.
+ * If we didn't clear a bit, we also won't set one: The address was in the
+ * low half of address space in that case with bit 47 already clear. The
+ * address can thus be left unchanged, whether canonical or not.
*/
mov $(HYPERVISOR_VIRT_END - 1), \scratch1
- mov $~0, \scratch2
+ mov $(VADDR_BITS - 1), \scratch2
cmp \ptr, \scratch1
+ /*
+ * Not needed: The value we have in \scratch1 will be truncated to 6 bits,
+ * thus yielding the value we need.
+ mov $63, \scratch1
+ */
+ cmovnb \scratch2, \scratch1
+ xor \scratch2, \scratch2
+ btr \scratch1, \ptr
+ .if 1
+ .irpc r, "acdb"
+ .if \scratch2 == %r\r\(\(x))
+ setc %\r\(\(l))
+ .endif
+ .endr
+ .irp r, bp, si, di
+ .if \scratch2 == %r\r
+ setc %\r\(\(l))
+ .endif
+ .endr
+ .irp r, 8, 9, 10, 11, 12, 13, 14, 15
+ .if \scratch2 == %r\r
+ setc %r\r\(\(b))
+ .endif
+ .endr
+ shl $63, \scratch2
+ .else
rcr $1, \scratch2
- and \scratch2, \ptr
+ .endif
+ or \scratch2, \ptr
#elif defined(CONFIG_DEBUG) && defined(CONFIG_PV)
xor $~\@, \scratch1
xor $~\@, \scratch2
On Tue, Nov 05, 2024 at 02:56:42PM +0100, Jan Beulich wrote:
> The original implementation has two issues: For one it doesn't preserve
> non-canonical-ness of inputs in the range 0x8000000000000000 through
> 0x80007fffffffffff. Bogus guest pointers in that range would not cause a
> (#GP) fault upon access, when they should.
>
> And then there is an AMD-specific aspect, where only the low 48 bits of
> an address are used for speculative execution; the architecturally
> mandated #GP for non-canonical addresses would be raised at a later
> execution stage. Therefore to prevent Xen controlled data to make it
> into any of the caches in a guest controllable manner, we need to
> additionally ensure that for non-canonical inputs bit 47 would be clear.
>
> See the code comment for how addressing both is being achieved.
>
> Fixes: 4dc181599142 ("x86/PV: harden guest memory accesses against speculative abuse")
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> RFC: Two variants of part of the logic are being presented, both with
> certain undesirable aspects: The first form is pretty large and
> ugly (some improvement may be possible by introducing further
> helper macros). The alternative form continues to use RCR, which
> generally would be nice to do away with. Then again that's also
> slightly smaller generated code.
Oh, I assume that's why there's a hardcoded .if 1, I was wondering
about that. What's the specific issue with using rcr? And why is the
more complex set of macros that use setc plus a shl better?
Why not use cmovc:
mov $(1 << 63), \scratch1
cmovc \scratch1, \scratch2
AFAICT \scratch1 is not used past the btr instruction, and hence can
be used for the cmovc?
Thanks, Roger.
On 23.01.2025 12:54, Roger Pau Monné wrote:
> On Tue, Nov 05, 2024 at 02:56:42PM +0100, Jan Beulich wrote:
>> The original implementation has two issues: For one it doesn't preserve
>> non-canonical-ness of inputs in the range 0x8000000000000000 through
>> 0x80007fffffffffff. Bogus guest pointers in that range would not cause a
>> (#GP) fault upon access, when they should.
>>
>> And then there is an AMD-specific aspect, where only the low 48 bits of
>> an address are used for speculative execution; the architecturally
>> mandated #GP for non-canonical addresses would be raised at a later
>> execution stage. Therefore to prevent Xen controlled data to make it
>> into any of the caches in a guest controllable manner, we need to
>> additionally ensure that for non-canonical inputs bit 47 would be clear.
>>
>> See the code comment for how addressing both is being achieved.
>>
>> Fixes: 4dc181599142 ("x86/PV: harden guest memory accesses against speculative abuse")
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> RFC: Two variants of part of the logic are being presented, both with
>> certain undesirable aspects: The first form is pretty large and
>> ugly (some improvement may be possible by introducing further
>> helper macros). The alternative form continues to use RCR, which
>> generally would be nice to do away with. Then again that's also
>> slightly smaller generated code.
>
> Oh, I assume that's why there's a hardcoded .if 1, I was wondering
> about that. What's the specific issue with using rcr?
It's slower than SHL. Albeit - checking a few places - not as much as I
thought I remembered it would be.
> And why is the
> more complex set of macros that use setc plus a shl better?
They're slightly longer (beyond the source complexity), but (presumably)
a little faster.
> Why not use cmovc:
>
> mov $(1 << 63), \scratch1
> cmovc \scratch1, \scratch2
>
> AFAICT \scratch1 is not used past the btr instruction, and hence can
> be used for the cmovc?
Such an alternative was already considered back at the time:
https://lists.xen.org/archives/html/xen-devel/2021-02/msg01067.html.
Granted I was wrong there about needing a 3rd scratch register, but
the code size consideration remains - we have dozens of instances of
this macro in the resulting binary, after all. Yet ftaod, this isn't
to mean we can't re-consider. Given the above I'm inclined though to
go the RCR route.
Jan
On Thu, Jan 23, 2025 at 01:44:34PM +0100, Jan Beulich wrote:
> On 23.01.2025 12:54, Roger Pau Monné wrote:
> > On Tue, Nov 05, 2024 at 02:56:42PM +0100, Jan Beulich wrote:
> >> The original implementation has two issues: For one it doesn't preserve
> >> non-canonical-ness of inputs in the range 0x8000000000000000 through
> >> 0x80007fffffffffff. Bogus guest pointers in that range would not cause a
> >> (#GP) fault upon access, when they should.
> >>
> >> And then there is an AMD-specific aspect, where only the low 48 bits of
> >> an address are used for speculative execution; the architecturally
> >> mandated #GP for non-canonical addresses would be raised at a later
> >> execution stage. Therefore to prevent Xen controlled data to make it
> >> into any of the caches in a guest controllable manner, we need to
> >> additionally ensure that for non-canonical inputs bit 47 would be clear.
> >>
> >> See the code comment for how addressing both is being achieved.
> >>
> >> Fixes: 4dc181599142 ("x86/PV: harden guest memory accesses against speculative abuse")
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> RFC: Two variants of part of the logic are being presented, both with
> >> certain undesirable aspects: The first form is pretty large and
> >> ugly (some improvement may be possible by introducing further
> >> helper macros). The alternative form continues to use RCR, which
> >> generally would be nice to do away with. Then again that's also
> >> slightly smaller generated code.
> >
> > Oh, I assume that's why there's a hardcoded .if 1, I was wondering
> > about that. What's the specific issue with using rcr?
>
> It's slower than SHL. Albeit - checking a few places - not as much as I
> thought I remembered it would be.
>
> > And why is the
> > more complex set of macros that use setc plus a shl better?
>
> They're slightly longer (beyond the source complexity), but (presumably)
> a little faster.
>
> > Why not use cmovc:
> >
> > mov $(1 << 63), \scratch1
> > cmovc \scratch1, \scratch2
> >
> > AFAICT \scratch1 is not used past the btr instruction, and hence can
> > be used for the cmovc?
>
> Such an alternative was already considered back at the time:
> https://lists.xen.org/archives/html/xen-devel/2021-02/msg01067.html.
> Granted I was wrong there about needing a 3rd scratch register, but
> the code size consideration remains - we have dozens of instances of
> this macro in the resulting binary, after all. Yet ftaod, this isn't
> to mean we can't re-consider. Given the above I'm inclined though to
> go the RCR route.
RCR or CMOVC would either be fine by me. The SETC is IMO too complex,
and using it would need a clear performance justification compared to
RCR or CMOVC.
Thanks, Roger.
© 2016 - 2026 Red Hat, Inc.