The switch statement in guest_io_okay() is a very expensive way of
pre-initialising x with ~0, and performing a partial read into it.
However, the logic isn't correct either.
In a real TSS, the CPU always reads two bytes (like here), and any TSS limit
violation turns silently into no-access. But, in-limit accesses trigger #PF
as usual. AMD document this property explicitly, and while Intel don't (so
far as I can tell), they do behave consistently with AMD.
Switch from __copy_from_guest_offset() to __copy_from_guest_pv(), like
everything else in this file. This removes code generation setting up
copy_from_user_hvm() (in the likely path even), and safety LFENCEs from
evaluate_nospec().
Change the logic to raise #PF if __copy_from_guest_pv() fails, rather than
disallowing the IO port access. This brings the behaviour better in line with
normal x86.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
RFC: Should we make the boundary check be (port + bytes + 8)? That would be
more correct, but liable to break unsuspecting VMs. Maybe we should just
comment our way out of it.
This needs to be combined with Jan's "x86/PV: simplify (and thus correct)
guest accessor functions" to function completely correctly. From XTF testing:
This series on its own:
--- Xen Test Framework ---
Environment: PV 64bit (Long mode 4 levels)
Test pv-emul-cr2
Error: %cr2 expected 0x3000, got 0x2fff
Test result: ERROR
This series plus Jan's fix:
--- Xen Test Framework ---
Environment: PV 64bit (Long mode 4 levels)
Test pv-emul-cr2
Test result: SUCCESS
Interestingly, the test also does an `INW` without an output parameter
straddling that page boundary, and it does reliably get 0x3000 even without
the accessor fix.
---
xen/arch/x86/pv/emul-priv-op.c | 27 ++++++++++++---------------
1 file changed, 12 insertions(+), 15 deletions(-)
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 978bd6c0775f..b5d184038fa3 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -169,29 +169,26 @@ static int guest_io_okay(unsigned int port, unsigned int bytes,
if ( (port + bytes) <= v->arch.pv.iobmp_limit )
{
- union { uint8_t bytes[2]; uint16_t mask; } x;
+ const void *__user addr = v->arch.pv.iobmp.p + (port >> 3);
+ uint16_t mask;
+ int rc;
- /*
- * Grab permission bytes from guest space. Inaccessible bytes are
- * read as 0xff (no access allowed).
- */
+ /* Grab permission bytes from guest space. */
if ( user_mode )
toggle_guest_pt(v);
- switch ( __copy_from_guest_offset(x.bytes, v->arch.pv.iobmp,
- port>>3, 2) )
- {
- default: x.bytes[0] = ~0;
- /* fallthrough */
- case 1: x.bytes[1] = ~0;
- /* fallthrough */
- case 0: break;
- }
+ rc = __copy_from_guest_pv(&mask, addr, 2);
if ( user_mode )
toggle_guest_pt(v);
- if ( (x.mask & (((1 << bytes) - 1) << (port & 7))) == 0 )
+ if ( rc )
+ {
+ x86_emul_pagefault(0, (unsigned long)addr + bytes - rc, ctxt);
+ return X86EMUL_EXCEPTION;
+ }
+
+ if ( (mask & (((1 << bytes) - 1) << (port & 7))) == 0 )
return X86EMUL_OKAY;
}
--
2.39.5
On 30.09.2024 18:18, Andrew Cooper wrote: > RFC: Should we make the boundary check be (port + bytes + 8)? That would be > more correct, but liable to break unsuspecting VMs. Maybe we should just > comment our way out of it. What would the "+ 8" be intended to express? (I take it you mean ... > --- a/xen/arch/x86/pv/emul-priv-op.c > +++ b/xen/arch/x86/pv/emul-priv-op.c > @@ -169,29 +169,26 @@ static intguest_io_okay(unsigned int port, unsigned int bytes, > > if ( (port + bytes) <= v->arch.pv.iobmp_limit ) ... this check, which looks correct to me as is. In particular with the "+ 8" there would appear to be no way to access ports at the very top of the 64k range anymore, as PHYSDEVOP_set_iobitmap handling caps nr_ports at 64k. IOW I think "commenting our way out of it" is the only possible approach.) With or without such a comment added Reviewed-by: Jan Beulich <jbeulich@suse.com> Jan
On 01/10/2024 8:36 am, Jan Beulich wrote: > On 30.09.2024 18:18, Andrew Cooper wrote: >> RFC: Should we make the boundary check be (port + bytes + 8)? That would be >> more correct, but liable to break unsuspecting VMs. Maybe we should just >> comment our way out of it. > What would the "+ 8" be intended to express? (I take it you mean ... > >> --- a/xen/arch/x86/pv/emul-priv-op.c >> +++ b/xen/arch/x86/pv/emul-priv-op.c >> @@ -169,29 +169,26 @@ static intguest_io_okay(unsigned int port, unsigned int bytes, >> >> if ( (port + bytes) <= v->arch.pv.iobmp_limit ) > ... this check, which looks correct to me as is. In particular with the > "+ 8" there would appear to be no way to access ports at the very top of > the 64k range anymore, as PHYSDEVOP_set_iobitmap handling caps nr_ports > at 64k. IOW I think "commenting our way out of it" is the only possible > approach.) This is actually a horrid corner case, different to what I was thinking yesterday. In a real TSS, the CPU doesn't read past the TR limit. Such accesses are turned into a permission failure. But in Xen, guest_io_okay() does read one byte past iobmp_limit. Previously, we'd consume what was there if it was accessible, or declare a permissions failure if it was inaccessible. Now, we'll throw #PF back to the kernel with %cr2 beyond the the limit. It's not OK for Xen to be reading past the given limit; it's not how real CPUs behave. Part of the problem is that, even since it's introduction in 013351bd7 (2006), the public interface was named nr_ports while Xen's internal field was called iobmp_limit. In terms of how it's used these days in PV guests: Linux points the bitmap into it's real TSS, and sets nr_ports to either 0 or 65536 depending on whether the userspace task is permitted to use IO ports or not. NetBSD doesn't seem to use PHYSDEVOP_set_iobitmap at all, and only uses PHYSDEVOP_set_iopl. MiniOS uses neither. I think Xen's internal variable wants renaming to not be a limit, and the comment for PHYSDEVOP_set_iobitmap wants extending to note that, like a real TSS, Xen might read one byte beyond the end of the bitmap. I think this wants doing in a separate patch. Thoughts? > With or without such a comment added > Reviewed-by: Jan Beulich <jbeulich@suse.com> Thanks. ~Andrew
On 01.10.2024 12:41, Andrew Cooper wrote: > I think Xen's internal variable wants renaming to not be a limit, and > the comment for PHYSDEVOP_set_iobitmap wants extending to note that, > like a real TSS, Xen might read one byte beyond the end of the bitmap. > > I think this wants doing in a separate patch. Thoughts? That's probably going to be best. Jan
© 2016 - 2024 Red Hat, Inc.