x86/PVH: extend checking in hwdom_fixup_p2m()

[PATCH] x86/PVH: extend checking in hwdom_fixup_p2m()

Posted by Jan Beulich 3 months, 3 weeks ago

We're generally striving to minimize behavioral differences between PV
and PVH Dom0. Using (just?) is_memory_hole() in the PVH case looks quite
a bit weaker to me, compared to the page ownership check done in the PV
case. Extend checking accordingly.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
The addition may actually be suitable to replace the use of
is_memory_hole() here. While dropping that would in particular extend
coverage to E820_RESERVED regions, those are identity-mapped anyway
(albeit oddly enough still by IOMMU code).

--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -184,6 +184,22 @@ static int hwdom_fixup_p2m(paddr_t addr)
          !is_memory_hole(_mfn(gfn), _mfn(gfn)) )
         return -EPERM;
 
+    /*
+     * Much like get_page_from_l1e() for PV Dom0 does, check that the page
+     * accessed is actually an MMIO one: Either its MFN is out of range, or
+     * it's owned by DOM_IO.
+     */
+    if ( mfn_valid(_mfn(gfn)) )
+    {
+        struct page_info *pg = mfn_to_page(_mfn(gfn));
+        const struct domain *owner = page_get_owner_and_reference(pg);
+
+        if ( owner )
+            put_page(pg);
+        if ( owner != dom_io )
+            return -EPERM;
+    }
+
     mfn = get_gfn(currd, gfn, &type);
     if ( !mfn_eq(mfn, INVALID_MFN) || !p2m_is_hole(type) )
         rc = mfn_eq(mfn, _mfn(gfn)) ? -EEXIST : -ENOTEMPTY;

Re: [PATCH] x86/PVH: extend checking in hwdom_fixup_p2m()

Posted by Roger Pau Monné 3 months, 2 weeks ago

On Mon, Jul 07, 2025 at 04:44:25PM +0200, Jan Beulich wrote:
> We're generally striving to minimize behavioral differences between PV
> and PVH Dom0. Using (just?) is_memory_hole() in the PVH case looks quite
> a bit weaker to me, compared to the page ownership check done in the PV
> case. Extend checking accordingly.

The PV code path is also used by non-priviledged domains, while
pf-fixup is strictly limited to the hardware domain.  But I agree that
the page ownership check is possibly better, and more in-line with the
PV counterpart.

> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

LGTM, I would just request that you also remove the is_memory_hole()
check at the same time.

> ---
> The addition may actually be suitable to replace the use of
> is_memory_hole() here. While dropping that would in particular extend
> coverage to E820_RESERVED regions, those are identity-mapped anyway
> (albeit oddly enough still by IOMMU code).

You could avoid getting E820_RESERVED regions identity mapped if
dom0-iommu=map-reserved=0 is specified on the command line, at which
point your suggestion to use the page ownership check seems better
because it would also allow for pf-fixup to deal with E820_RESERVED
regions.

I think it would be better to remove the is_memory_hole() check if we
introduce the page ownership checking.  Otherwise it's kind of
redundant and possibly confusing.

> 
> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -184,6 +184,22 @@ static int hwdom_fixup_p2m(paddr_t addr)
>           !is_memory_hole(_mfn(gfn), _mfn(gfn)) )
>          return -EPERM;
>  
> +    /*
> +     * Much like get_page_from_l1e() for PV Dom0 does, check that the page
> +     * accessed is actually an MMIO one: Either its MFN is out of range, or
> +     * it's owned by DOM_IO.
> +     */
> +    if ( mfn_valid(_mfn(gfn)) )
> +    {
> +        struct page_info *pg = mfn_to_page(_mfn(gfn));

We should consider introducing a mfn_t mfn = _mfn(gfn) local variable,
as there's a non-trivial amount of _mfn(gfn) instances.  Not that you
need to do it here, just noticed the amount of instances we have of
it.

Thanks, Roger.

Re: [PATCH] x86/PVH: extend checking in hwdom_fixup_p2m()

Posted by Jan Beulich 3 months, 2 weeks ago

On 14.07.2025 17:00, Roger Pau Monné wrote:
> On Mon, Jul 07, 2025 at 04:44:25PM +0200, Jan Beulich wrote:
>> --- a/xen/arch/x86/hvm/emulate.c
>> +++ b/xen/arch/x86/hvm/emulate.c
>> @@ -184,6 +184,22 @@ static int hwdom_fixup_p2m(paddr_t addr)
>>           !is_memory_hole(_mfn(gfn), _mfn(gfn)) )
>>          return -EPERM;
>>  
>> +    /*
>> +     * Much like get_page_from_l1e() for PV Dom0 does, check that the page
>> +     * accessed is actually an MMIO one: Either its MFN is out of range, or
>> +     * it's owned by DOM_IO.
>> +     */
>> +    if ( mfn_valid(_mfn(gfn)) )
>> +    {
>> +        struct page_info *pg = mfn_to_page(_mfn(gfn));
> 
> We should consider introducing a mfn_t mfn = _mfn(gfn) local variable,
> as there's a non-trivial amount of _mfn(gfn) instances.  Not that you
> need to do it here, just noticed the amount of instances we have of
> it.

There already is an "mfn" local there, and I first used it here, but then
considered the result confusing.

Jan

Re: [PATCH] x86/PVH: extend checking in hwdom_fixup_p2m()

Posted by Jan Beulich 3 months, 3 weeks ago

On 07.07.2025 16:44, Jan Beulich wrote:
> We're generally striving to minimize behavioral differences between PV
> and PVH Dom0. Using (just?) is_memory_hole() in the PVH case looks quite
> a bit weaker to me, compared to the page ownership check done in the PV
> case. Extend checking accordingly.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> The addition may actually be suitable to replace the use of
> is_memory_hole() here. While dropping that would in particular extend
> coverage to E820_RESERVED regions, those are identity-mapped anyway
> (albeit oddly enough still by IOMMU code).

This really is more of a by-product of me looking into the failing (by
default) memory accesses that Dom0 is doing on systems I can put my hands
on. Those accesses are indeed occurring in the context of ACPI evaluating
e.g. _INI or _STA methods, and at least some of the underlying memory
regions also appear in /proc/iomem. Which means that in principle there
could be the option of Dom0 informing us of (at least some of) such
regions.

However, the accesses occur ahead of the kernel actually collecting
resource data. In particular, the intention would have been for the
to-be-added sub-op to be invoked from drivers/pnp/system.c:reserve_range().
That runs from an fs_initcall though, while the accesses occur from
underneath acpi_bus_init() and acpi_scan_init(), both called from
acpi_init(), which is a subsys_initcall. It also doesn't look as if
re-arranging things in Linux would be reasonably possible. Hence the only
(vague) option might be to duplicate some of what is already being done.
Yet besides this appearing undesirable to me, even if we hooked such
duplicate code up from acpi_arch_init(), that would only cover accesses
from underneath acpi_scan_init(); acpi_bus_init() runs yet earlier.

Nevertheless, for the sake of completeness I'll reproduce my draft, patch
(not having the intended effect) below. For the moment, however, I think
I need to give up my hope that we could deal with the problem by other
than the "dom0=pf-fixup" approach.

Jan
---
Along the lines of the respective remark in "x86/PVH: extend checking
in hwdom_fixup_p2m()" submitted earlier, is_memory_hole() isn't used in
PHYSDEVOP_system_memory handling.

The new physdev-op is intended to be invoked from e.g. Linux'es
drivers/pnp/system.c:reserve_range().

Whether to make this a physdev-op or a platform-op wasn't quite clear
to me; the two have a pretty fuzzy boundary between them.

By observation, the size field doesn't need to be 64 bits wide. If we
still wanted to make it so, we'd definitely need to add preemption
checking, too. (We may need to do so anyway, I simply wasn't sure.)

There are sub-page regions being reported on the system I'm looking at.
Not sure what to do about them.

--- unstable.orig/xen/arch/x86/hvm/hypercall.c
+++ unstable/xen/arch/x86/hvm/hypercall.c
@@ -88,6 +88,7 @@ long hvm_physdev_op(int cmd, XEN_GUEST_H
     case PHYSDEVOP_pci_device_remove:
     case PHYSDEVOP_pci_device_reset:
     case PHYSDEVOP_dbgp_op:
+    case PHYSDEVOP_system_memory:
         if ( !is_hardware_domain(currd) )
             return -ENOSYS;
         break;
--- unstable.orig/xen/arch/x86/physdev.c
+++ unstable/xen/arch/x86/physdev.c
@@ -27,6 +27,8 @@ int physdev_unmap_pirq(struct domain *d,
 #ifndef COMPAT
 typedef long ret_t;
 
+#include <asm/altp2m.h>
+
 static int physdev_hvm_map_pirq(
     struct domain *d, int type, int *index, int *pirq)
 {
@@ -619,6 +621,71 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
         break;
     }
 
+#ifndef COMPAT
+    case PHYSDEVOP_system_memory: {
+        struct xen_physdev_sysmem_op op;
+
+        if ( !is_hardware_domain(currd) )
+            ret = -EPERM;
+        else if ( copy_from_guest(&op, arg, 1) )
+            ret = -EFAULT;
+        else if ( !op.size )
+            ret = 0;
+        else if ( (op.addr | op.size) & (PAGE_SIZE - 1) ||
+                  (op.addr >> paddr_bits) ||
+                  ((op.addr + op.size - 1) >> paddr_bits) ||
+                  op.flags )
+            ret = -EINVAL;
+        else
+        {
+            mfn_t mfn = maddr_to_mfn(op.addr);
+
+            if ( mfn_valid(mfn) )
+            {
+                struct page_info *pg = mfn_to_page(mfn);
+                const struct domain *owner = page_get_owner_and_reference(pg);
+
+                if ( owner )
+                    put_page(pg);
+                if ( owner != dom_io )
+                    return -EINVAL;
+            }
+
+            /*
+             * Avoid the need to fix up behind a PVH Dom0, by inserting the
+             * desired mapping right away.  See also hwdom_fixup_p2m().
+             */
+            if ( !is_hvm_domain(currd) )
+                ret = 0;
+            else if ( !iomem_access_permitted(currd, mfn_x(mfn),
+                                              (mfn_x(mfn) +
+                                               PFN_DOWN(op.size) - 1)) ||
+                      altp2m_active(currd) )
+                ret = -EACCES;
+            else
+{ printk("sysmem: %08lx (%04x bytes; hole:%d)\n",//temp
+         op.addr, op.size, is_memory_hole(mfn, mfn_add(mfn, PFN_DOWN(op.size) - 1)));
+  ret = 0; break;//temp
+                do {
+                    p2m_type_t type;
+                    unsigned long gfn = mfn_x(mfn);
+                    mfn_t omfn = get_gfn(currd, gfn, &type);
+
+                    if ( !mfn_eq(omfn, INVALID_MFN) || !p2m_is_hole(type) )
+                        ret = mfn_eq(omfn, mfn) ? 0 : -ENOTEMPTY;
+                    else
+                        ret = set_mmio_p2m_entry(currd, _gfn(gfn), mfn, 0);
+                    put_gfn(currd, gfn);
+
+                    mfn = mfn_add(mfn, 1);
+                    op.size -= PAGE_SIZE;
+                } while ( !ret && op.size );
+}
+        }
+        break;
+    }
+#endif /* !COMPAT */
+
     default:
         ret = pci_physdev_op(cmd, arg);
         break;
--- unstable.orig/xen/include/public/physdev.h
+++ unstable/xen/include/public/physdev.h
@@ -344,6 +344,20 @@ typedef struct physdev_dbgp_op physdev_d
 DEFINE_XEN_GUEST_HANDLE(physdev_dbgp_op_t);
 
 /*
+ * Notify the hypervisor of a system memory range ("Motherboard resource"),
+ * some of which can only be discovered via interpreting ACPI's AML.  For PVH
+ * Dom0 such ranges, if deemed valid, would then have a 1:1 translation
+ * inserted into their P2M.
+ */
+#define PHYSDEVOP_system_memory         33
+struct xen_physdev_sysmem_op {
+    uint64_t addr;
+    uint32_t size;
+    /* No flags defined so far; field must be zero. */
+    uint32_t flags;
+};
+
+/*
  * Notify that some PIRQ-bound event channels have been unmasked.
  * ** This command is obsolete since interface version 0x00030202 and is **
  * ** unsupported by newer versions of Xen.                              **